summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2025-11-24btrfs: reduce block group critical section in btrfs_add_reserved_bytes()Filipe Manana
We are doing some things inside the block group's critical section that are relevant only to the space_info: updating the space_info counters bytes_reserved and bytes_may_use as well as trying to grant tickets (calling btrfs_try_granting_tickets()), and this later can take some time. So move all those updates to outside the block group's critical section and still inside the space_info's critical section. Like this we keep the block group's critical section only for block group updates and can help reduce contention on a block group's lock. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: reduce block group critical section in btrfs_free_reserved_bytes()Filipe Manana
There's no need to update the space_info fields (bytes_reserved, max_extent_size, bytes_readonly, bytes_zone_unusable) while holding the block group's spinlock. So move those updates to happen after we unlock the block group (and while holding the space_info locked of course), so that all we do under the block group's critical section is to update the block group itself. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: reduce space_info critical section in btrfs_chunk_alloc()Filipe Manana
There's no need to update local variables while holding the space_info's spinlock, since the update isn't using anything from the space_info. So move these updates outside the critical section to shorten it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove double underscore prefix from __reserve_bytes()Filipe Manana
The use of a double underscore prefix is discouraged and we have no justification at all for it all since there's no reserved_bytes() counter part. So remove the prefix. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: process ticket outside global reserve critical sectionFilipe Manana
In steal_from_global_rsv() there's no need to process the ticket inside the critical section of the global reserve. Move the ticket processing to happen after the critical section. This helps reduce contention on the global reserve's spinlock. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: assign booleans to global reserve's full fieldFilipe Manana
We have a couple places that are assigning 0 and 1 to the full field of the global reserve. This is harmless since 0 is converted to false and 1 converted to true, but for better readability, replace these with true and false since the field is of type bool. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: assert space_info is locked in steal_from_global_rsv()Filipe Manana
The caller is supposed to have locked the space_info, so assert that. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: avoid unnecessary reclaim calculation in ↵Filipe Manana
priority_reclaim_metadata_space() If the given ticket was already served (its ->bytes is 0), then we wasted time calculating the metadata reclaim size. So calculate it only after we checked the ticket was not yet served. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: shorten critical section in btrfs_preempt_reclaim_metadata_space()Filipe Manana
We are doing a lot of small calculations and assignments while holding the space_info's spinlock, which is a heavily used lock for space reservation and flushing. There's no point in holding the lock for so long when all we want is to call need_preemptive_reclaim() and get a consistent value for a couple of counters from the space_info. Instead, grab the counters into local variables, release the lock and then use the local variables. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: increment loop count outside critical section during metadata reclaimFilipe Manana
In btrfs_preempt_reclaim_metadata_space() there's no need to increment the local variable that tracks the number of iterations of the while loop while inside the critical section delimited by the space_info's spinlock. That spinlock is heavily used by space reservation and flushing code, so it's desirable to have its critical sections as short as possible. So move the loop count incremented outside the critical section. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: bail out earlier from need_preemptive_reclaim() if we have ticketsFilipe Manana
Instead of doing some calculations and then return false if it turns out we have queued tickets, check first if we have tickets and return false immediately if we have tickets, without wasting time on doing those computations. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: inline btrfs_space_info_used()Filipe Manana
The function is simple enough to be inlined and in fact doing it even reduces the object code. In x86_64 with gcc 14.2.0-19 from Debian the results were the following: Before this change $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1919410 161703 15592 2096705 1ffe41 fs/btrfs/btrfs.ko After this change $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1918991 161675 15592 2096258 1ffc82 fs/btrfs/btrfs.ko Also remove the ASSERT() that checks the space_info argument is not NULL, as it's odd to be there since it can never be NULL and in case that ever happens during development, a stack trace from a NULL pointer dereference will be obvious. It was originally added when btrfs_space_info_used() was introduced in commit 4136135b080f ("Btrfs: use helper to get used bytes of space_info"). Also add a lockdep assertion to check the space_info's lock is being held by the calling task. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: avoid used space computation when reserving spaceFilipe Manana
In __reserve_bytes() we have 3 repeated calls to btrfs_space_info_used(), one early on as soon as take the space_info's spinlock, another one when we call btrfs_can_overcommit(), which calls btrfs_space_info_used() again, and a final one when we are reserving for a flush emergency. During all these calls we are holding the space_info's spinlock, which is heavily used by the space reservation and flushing code, so it's desirable to make the critical sections as short as possible. So make this more efficient by: 1) Instead of calling btrfs_can_overcommit() call the new variant can_overcommit() which takes the space_info's used space as an argument and pass the value we already computed and have in the 'used' variable; 2) Instead of calling btrfs_space_info_used() with its second argument as false when we are doing a flush emergency, decrement the space_info's bytes_may_use counter from the 'used' variable, as the difference between passing true or false as the second argument to btrfs_space_info_used() is whether or not to include the space_info's bytes_may_use counter in the computation. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: avoid used space computation when trying to grant ticketsFilipe Manana
In btrfs_try_granting_tickets(), we call btrfs_can_overcommit() and that calls btrfs_space_info_used(). But we already keep track, in the 'used' local variable, of the used space in the space_info, so we are just repeating the same computation and doing an extra function call while we are holding the space_info's spinlock, which is heavily used by the space reservation and flushing code. So add a local variant of btrfs_can_overcommit() that takes in the used space as an argument and therefore does not call btrfs_space_info_used(), and use it in btrfs_try_granting_tickets(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: make btrfs_can_overcommit() return bool instead of intFilipe Manana
It's a boolean function, so switch its return type to bool. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: avoid recomputing used space in btrfs_try_granting_tickets()Filipe Manana
In every iteration of the loop we call btrfs_space_info_used() which sums a bunch of fields from a space_info object. This implies doing a function call besides the sum, and we are holding the space_info's spinlock while we do this, so we want to keep the critical section as short as possible since that spinlock is used in all the code for space reservation and flushing (therefore it's heavily used). So call btrfs_try_granting_tickets() only once, before entering the loop, and then update it as we remove tickets. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: return real error when failing tickets in maybe_fail_all_tickets()Filipe Manana
In case we had a transaction abort we set a ticket's error to -EIO, but we have the real error that caused the transaction to be aborted returned by the macro BTRFS_FS_ERROR(). So use that real error instead of -EIO. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: subpage: simplify the PAGECACHE_TAG_TOWRITE handlingQu Wenruo
In function btrfs_subpage_set_writeback() we need to keep the PAGECACHE_TAG_TOWRITE tag if the folio is still dirty. This is a needed quirk for support async extents, as a subpage range can almost suddenly go writeback, without touching other subpage ranges in the same folio. However we can simplify the handling by replace the open-coded tag clearing by passing the @keep_write flag depending on if the folio is dirty. Since we're holding the subpage lock already, no one is able to change the dirty/writeback flag, thus it's safe to check the folio dirty before calling __folio_start_writeback(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove pointless data_end assignment in btrfs_extent_item()Filipe Manana
There's no point in setting 'data_end' to 'old_data' as we don't use it afterwards. So remove the redundant assignment which was never needed and added when the function was first added in commit 6567e837df07 ("Btrfs: early work to file_write in big extents"). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: use the key format macros when printing keysFilipe Manana
Change all locations that print a key to use the new macros to print them in order to ensure a consistent style and avoid repetitive code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: add macros to facilitate printing of keysFilipe Manana
There's a lot of places where we need to print a key, and it's tiresome to type the format specifier, typically "(%llu %u %llu)", as well as passing 3 arguments to a prink family function (key->objectid, key->type, key->offset). So add a couple macros for this just like we have for csum values in btrfs_inode.h (CSUM_FMT and CSUM_FMT_VALUE). This also ensures that we consistently print a key in the same format, always as "(%llu %llu %llu)", which is the most common format we use, but we have a few variations such as "[%llu %llu %llu]" for no good reason. This patch introduces the macros while the next one makes use of it. This is to ease backports of future patches, since then we can backport this patch which is simple and short and then backport those future patches, as the next patch in the series that makes use of these new macros is quite large and may have some dependencies. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove redundant refcount check in btrfs_put_transaction()Xuanqiang Luo
Eric Dumazet removed the redundant refcount check for sk_refcnt, I noticed a similar issue in btrfs_put_transaction(). refcount_dec_and_test() already checks for a zero refcount and complains, making the preceding WARN_ON redundant. This is a leftover from the atomic_t times. Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove fs_info argument from btrfs_zoned_activate_one_bg()Filipe Manana
We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove fs_info argument from btrfs_sysfs_add_space_info_type()Filipe Manana
We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: more trivial BTRFS_PATH_AUTO_FREE conversionsSun YangKai
Convert more of the trivial pattern for the auto freeing of btrfs_path with goto -> return conversions where applicable. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove fs_info argument from btrfs_reserve_metadata_bytes()Filipe Manana
We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove fs_info argument from __reserve_bytes()Filipe Manana
We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: fix parameter documentation for btrfs_reserve_data_bytes()Filipe Manana
We don't have a fs_info argument anymore since commit 5d39fda880be ("btrfs: pass btrfs_space_info to btrfs_reserve_data_bytes()"), it was replaced by a space_info argument. So update the documentation. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove fs_info argument from maybe_clamp_preempt()Filipe Manana
We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove fs_info argument from handle_reserve_ticket()Filipe Manana
We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove fs_info argument from steal_from_global_rsv()Filipe Manana
We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove fs_info argument from need_preemptive_reclaim()Filipe Manana
We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove fs_info argument from btrfs_calc_reclaim_metadata_size()Filipe Manana
We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove fs_info argument from shrink_delalloc() and flush_space()Filipe Manana
We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove fs_info argument from btrfs_dump_space_info()Filipe Manana
We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove fs_info argument from btrfs_can_overcommit()Filipe Manana
We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove fs_info argument from calc_available_free_space()Filipe Manana
We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove fs_info argument from maybe_fail_all_tickets()Filipe Manana
We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove fs_info argument from priority_reclaim_metadata_space()Filipe Manana
We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove fs_info argument from priority_reclaim_data_space()Filipe Manana
We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove fs_info argument from btrfs_try_granting_tickets()Filipe Manana
We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: avoid repeated computations in btrfs_mark_ordered_io_finished()Filipe Manana
We're computing a few values several times: 1) The current ordered extent's end offset inside the while loop, we have computed it and stored it in the 'entry_end' variable but then we compute it again later as the first argument to the min() macro; 2) The end file offset, open coded 3 times; 3) The current length (stored in variable 'len') computed 2 times, one inside an assertion and the other when assigning to the 'len' variable. So use existing variables and add new ones to prevent repeating these expressions and reduce the source code. We were also subtracting one from the result of min() macro call and then adding 1 back in the next line, making both operations pointless. So just remove the decrement and increment by 1. This also reduces very slightly the object code. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1916576 161679 15592 2093847 1ff317 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1916556 161679 15592 2093827 1ff303 fs/btrfs/btrfs.ko Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: avoid multiple i_size rounding in btrfs_truncate()Filipe Manana
We have the inode locked so no one can concurrently change its i_size and neither do we change it ourselves, so there's no point in keep rounding it in the while loop and setting it up in the control structure. That only causes confusion when reading the code. So move all the i_size setup and rounding out of the loop and assert the inode is locked. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: consistently round up or down i_size in btrfs_truncate()Filipe Manana
We're using different ways to round down the i_size by sector size, one with a bitwise and with a negated mask and another with ALIGN_DOWN(), and using ALIGN() to round up. Replace these uses with the round_down() and round_up() macros which have have names that make it clear the direction of the rounding (unlike the ALIGN() macro) and getting rid of the bitwise and, negated mask and local variable for the mask. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: add unlikely to unexpected error case in extent_writepages()Filipe Manana
We don't expect to hit errors and log the error message, so add the unlikely annotation to make it clear and to hint the compiler that it may reorganize code to be more efficient. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: split assertion into two in extent_writepage_io()Filipe Manana
If the assertion fails we don't get to know which of the two expressions failed and neither the values used in each expression. So split the assertion into two, each for a single expression, so that if any is triggered we see a line number reported in a stack trace that points to which expression failed. Also make the assertions use the verbose mode to print the values involved in the computations. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: use variable for end offset in extent_writepage_io()Filipe Manana
Instead of repeating the expression "start + len" multiple times, store it in a variable and use it where needed. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: truncate ordered extent when skipping writeback past i_sizeFilipe Manana
While running test case btrfs/192 from fstests with support for large folios (needs CONFIG_BTRFS_EXPERIMENTAL=y) I ended up getting very sporadic btrfs check failures reporting that csum items were missing. Looking into the issue it turned out that btrfs check searches for csum items of a file extent item with a range that spans beyond the i_size of a file and we don't have any, because the kernel's writeback code skips submitting bios for ranges beyond eof. It's not expected however to find a file extent item that crosses the rounded up (by the sector size) i_size value, but there is a short time window where we can end up with a transaction commit leaving this small inconsistency between the i_size and the last file extent item. Example btrfs check output when this happens: $ btrfs check /dev/sdc Opening filesystem to check... Checking filesystem on /dev/sdc UUID: 69642c61-5efb-4367-aa31-cdfd4067f713 [1/8] checking log skipped (none written) [2/8] checking root items [3/8] checking extents [4/8] checking free space tree [5/8] checking fs roots root 5 inode 332 errors 1000, some csum missing ERROR: errors found in fs roots (...) Looking at a tree dump of the fs tree (root 5) for inode 332 we have: $ btrfs inspect-internal dump-tree -t 5 /dev/sdc (...) item 28 key (332 INODE_ITEM 0) itemoff 2006 itemsize 160 generation 17 transid 19 size 610969 nbytes 86016 block group 0 mode 100666 links 1 uid 0 gid 0 rdev 0 sequence 11 flags 0x0(none) atime 1759851068.391327881 (2025-10-07 16:31:08) ctime 1759851068.410098267 (2025-10-07 16:31:08) mtime 1759851068.410098267 (2025-10-07 16:31:08) otime 1759851068.391327881 (2025-10-07 16:31:08) item 29 key (332 INODE_REF 340) itemoff 1993 itemsize 13 index 2 namelen 3 name: f1f item 30 key (332 EXTENT_DATA 589824) itemoff 1940 itemsize 53 generation 19 type 1 (regular) extent data disk byte 21745664 nr 65536 extent data offset 0 nr 65536 ram 65536 extent compression 0 (none) (...) We can see that the file extent item for file offset 589824 has a length of 64K and its number of bytes is 64K. Looking at the inode item we see that its i_size is 610969 bytes which falls within the range of that file extent item [589824, 655360[. Looking into the csum tree: $ btrfs inspect-internal dump-tree /dev/sdc (...) item 15 key (EXTENT_CSUM EXTENT_CSUM 21565440) itemoff 991 itemsize 200 range start 21565440 end 21770240 length 204800 item 16 key (EXTENT_CSUM EXTENT_CSUM 1104576512) itemoff 983 itemsize 8 range start 1104576512 end 1104584704 length 8192 (..) We see that the csum item number 15 covers the first 24K of the file extent item - it ends at offset 21770240 and the extent's disk_bytenr is 21745664, so we have: 21770240 - 21745664 = 24K We see that the next csum item (number 16) is completely outside the range, so the remaining 40K of the extent doesn't have csum items in the tree. If we round up the i_size to the sector size, we get: round_up(610969, 4096) = 614400 If we subtract from that the file offset for the extent item we get: 614400 - 589824 = 24K So the missing 40K corresponds to the end of the file extent item's range minus the rounded up i_size: 655360 - 614400 = 40K Normally we don't expect a file extent item to span over the rounded up i_size of an inode, since when truncating, doing hole punching and other operations that trim a file extent item, the number of bytes is adjusted. There is however a short time window where the kernel can end up, temporarily,persisting an inode with an i_size that falls in the middle of the last file extent item and the file extent item was not yet trimmed (its number of bytes reduced so that it doesn't cross i_size rounded up by the sector size). The steps (in the kernel) that lead to such scenario are the following: 1) We have inode I as an empty file, no allocated extents, i_size is 0; 2) A buffered write is done for file range [589824, 655360[ (length of 64K) and the i_size is updated to 655360. Note that we got a single large folio for the range (64K); 3) A truncate operation starts that reduces the inode's i_size down to 610969 bytes. The truncate sets the inode's new i_size at btrfs_setsize() by calling truncate_setsize() and before calling btrfs_truncate(); 4) At btrfs_truncate() we trigger writeback for the range starting at 610304 (which is the new i_size rounded down to the sector size) and ending at (u64)-1; 5) During the writeback, at extent_write_cache_pages(), we get from the call to filemap_get_folios_tag(), the 64K folio that starts at file offset 589824 since it contains the start offset of the writeback range (610304); 6) At writepage_delalloc() we find the whole range of the folio is dirty and therefore we run delalloc for that 64K range ([589824, 655360[), reserving a 64K extent, creating an ordered extent, etc; 7) At extent_writepage_io() we submit IO only for subrange [589824, 614400[ because the inode's i_size is 610969 bytes (rounded up by sector size is 614400). There, in the while loop we intentionally skip IO beyond i_size to avoid any unnecessay work and just call btrfs_mark_ordered_io_finished() for the range [614400, 655360[ (which has a 40K length); 8) Once the IO finishes we finish the ordered extent by ending up at btrfs_finish_one_ordered(), join transaction N, insert a file extent item in the inode's subvolume tree for file offset 589824 with a number of bytes of 64K, and update the inode's delayed inode item or directly the inode item with a call to btrfs_update_inode_fallback(), which results in storing the new i_size of 610969 bytes; 9) Transaction N is committed either by the transaction kthread or some other task committed it (in response to a sync or fsync for example). At this point we have inode I persisted with an i_size of 610969 bytes and file extent item that starts at file offset 589824 and has a number of bytes of 64K, ending at an offset of 655360 which is beyond the i_size rounded up to the sector size (614400). --> So after a crash or power failure here, the btrfs check program reports that error about missing checksum items for this inode, as it tries to lookup for checksums covering the whole range of the extent; 10) Only after transaction N is committed that at btrfs_truncate() the call to btrfs_start_transaction() starts a new transaction, N + 1, instead of joining transaction N. And it's with transaction N + 1 that it calls btrfs_truncate_inode_items() which updates the file extent item at file offset 589824 to reduce its number of bytes from 64K down to 24K, so that the file extent item's range ends at the i_size rounded up to the sector size (614400 bytes). Fix this by truncating the ordered extent at extent_writepage_io() when we skip writeback because the current offset in the folio is beyond i_size. This ensures we don't ever persist a file extent item with a number of bytes beyond the rounded up (by sector size) value of the i_size. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: implement remove_bdev and shutdown super operation callbacksQu Wenruo
For the ->remove_bdev() callback, btrfs will: - Mark the target device as missing - Go degraded if the fs can afford it - Return error other wise Thus falls back to the shutdown callback For the ->shutdown callback, btrfs will: - Set the SHUTDOWN flag Which will reject all new incoming operations, and make all writeback to fail. The behavior is the same as the NOLOGFLUSH behavior. To support the lookup from bdev to a btrfs_device, btrfs_dev_lookup_args is enhanced to have a new @devt member. If set, we should be able to use that @devt member to uniquely locating a btrfs device. I know the shutdown can be a little overkilled, if one has a RAID1 metadata and RAID0 data, in that case one can still read data with 50% chance to got some good data. But a filesystem returning -EIO for half of the time is not really considered usable. Further it can also be as bad as the only device went missing for a single device btrfs. So here we go safe other than sorry when handling missing device. And the remove_bdev callback will be hidden behind experimental features for now, the reasons are: - There are not enough btrfs specific bdev removal test cases The existing test cases are all removing the only device, thus only exercises the ->shutdown() behavior. - Not yet determined what's the expected behavior Although the current auto-degrade behavior is no worse than the old behavior, it may not always be what the end users want. Before there is a concrete interface, better hide the new feature from end users. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Tested-by: Anand Jain <asj@kernel.org> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: implement shutdown ioctlQu Wenruo
The shutdown ioctl should follow the XFS one, which use magic number 'X', and ioctl number 125, with a uint32 as flags. For now btrfs don't distinguish DEFAULT and LOGFLUSH flags (just like f2fs), both will freeze the fs first (implies committing the current transaction), setting the SHUTDOWN flag and finally thaw the fs. For NOLOGFLUSH flag, the freeze/thaw part is skipped thus the current transaction is aborted. The new shutdown ioctl is hidden behind experimental features for more testing. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Tested-by: Anand Jain <asj@kernel.org> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>