summaryrefslogtreecommitdiff
path: root/fs/btrfs/inode.c
AgeCommit message (Collapse)Author
7 daysMerge tag 'for-6.19-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Features: - shutdown ioctl support (needs CONFIG_BTRFS_EXPERIMENTAL for now): - set filesystem state as being shut down (also named going down in other filesystems), where all active operations return EIO and this cannot be changed until unmount - pending operations are attempted to be finished but error messages may still show up depending on where exactly the shutdown happened - scrub (and device replace) vs suspend/hibernate: - a running scrub will prevent suspend, which can be annoying as suspend is an immediate request and scrub is not critical - filesystem freezing before suspend was not sufficient as the problem was in process freezing - behaviour change: on suspend scrub and device replace are cancelled, where scrub can record the last state and continue from there; the device replace has to be restarted from the beginning - zone stats exported in sysfs, from the perspective of the filesystem this includes active, reclaimable, relocation etc zones Performance: - improvements when processing space reservation tickets by optimizing locking and shrinking critical sections, cumulative improvements in lockstat numbers show +15% Notable fixes: - use vmalloc fallback when allocating bios as high order allocations can happen with wide checksums (like sha256) - scrub will always track the last position of progress so it's not starting from zero after an error Core: - under experimental config, checksum calculations are offloaded to process context, simplifies locking and allows to remove compression write worker kthread(s): - speed improvement in direct IO throughput with buffered IO fallback is +15% when not offloaded but this is more related to internal crypto subsystem improvements - this will be probably default in the future removing the sysfs tunable - (experimental) block size > page size updates: - support more operations when not using large folios (encoded read/write and send) - raid56 - more preparations for fscrypt support Other: - more conversions to auto-cleaned variables - parameter cleanups and removals - extended warning fixes - improved printing of structured values like keys - lots of other cleanups and refactoring" * tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits) btrfs: remove unnecessary inode key in btrfs_log_all_parents() btrfs: remove redundant zero/NULL initializations in btrfs_alloc_root() btrfs: remaining BTRFS_PATH_AUTO_FREE conversions btrfs: send: do not allocate memory for xattr data when checking it exists btrfs: send: add unlikely to all unexpected overflow checks btrfs: reduce arguments to btrfs_del_inode_ref_in_log() btrfs: remove root argument from btrfs_del_dir_entries_in_log() btrfs: use test_and_set_bit() in btrfs_delayed_delete_inode_ref() btrfs: don't search back for dir inode item in INO_LOOKUP_USER btrfs: don't rewrite ret from inode_permission btrfs: add orig_logical to btrfs_bio for encryption btrfs: disable verity on encrypted inodes btrfs: disable various operations on encrypted inodes btrfs: remove redundant level reset in btrfs_del_items() btrfs: simplify leaf traversal after path release in btrfs_next_old_leaf() btrfs: optimize balance_level() path reference handling btrfs: factor out root promotion logic into promote_child_to_root() btrfs: raid56: remove the "_step" infix btrfs: raid56: enable bs > ps support btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases ...
9 daysMerge tag 'vfs-6.19-rc1.fs_header' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fs header updates from Christian Brauner: "This contains initial work to start splitting up fs.h. Begin the long-overdue work of splitting up the monolithic fs.h header. The header has grown to over 3000 lines and includes types and functions for many different subsystems, making it difficult to navigate and causing excessive compilation dependencies. This series introduces new focused headers for superblock-related code: - Rename fs_types.h to fs_dirent.h to better reflect its actual content (directory entry types) - Add fs/super_types.h containing superblock type definitions - Add fs/super.h containing superblock function declarations This is the first step in a longer effort to modularize the VFS headers. Cleanups: - Inode Field Layout Optimization (Mateusz Guzik) Move inode fields used during fast path lookup closer together to improve cache locality during path resolution. - current_umask() Optimization (Mateusz Guzik) Inline current_umask() and move it to fs_struct.h. This improves performance by avoiding function call overhead for this frequently-used function, and places it in a more appropriate header since it operates on fs_struct" * tag 'vfs-6.19-rc1.fs_header' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: move inode fields used during fast path lookup closer together fs: inline current_umask() and move it to fs_struct.h fs: add fs/super.h header fs: add fs/super_types.h header fs: rename fs_types.h to fs_dirent.h
9 daysMerge tag 'vfs-6.19-rc1.folio' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull folio updates from Christian Brauner: "Add a new folio_next_pos() helper function that returns the file position of the first byte after the current folio. This is a common operation in filesystems when needing to know the end of the current folio. The helper is lifted from btrfs which already had its own version, and is now used across multiple filesystems and subsystems: - btrfs - buffer - ext4 - f2fs - gfs2 - iomap - netfs - xfs - mm This fixes a long-standing bug in ocfs2 on 32-bit systems with files larger than 2GiB. Presumably this is not a common configuration, but the fix is backported anyway. The other filesystems did not have bugs, they were just mildly inefficient. This also introduce uoff_t as the unsigned version of loff_t. A recent commit inadvertently changed a comparison from being unsigned (on 64-bit systems) to being signed (which it had always been on 32-bit systems), leading to sporadic fstests failures. Generally file sizes are restricted to being a signed integer, but in places where -1 is passed to indicate "up to the end of the file", it is convenient to have an unsigned type to ensure comparisons are always unsigned regardless of architecture" * tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Add uoff_t mm: Use folio_next_pos() xfs: Use folio_next_pos() netfs: Use folio_next_pos() iomap: Use folio_next_pos() gfs2: Use folio_next_pos() f2fs: Use folio_next_pos() ext4: Use folio_next_pos() buffer: Use folio_next_pos() btrfs: Use folio_next_pos() filemap: Add folio_next_pos()
9 daysMerge tag 'vfs-6.19-rc1.writeback' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull writeback updates from Christian Brauner: "Features: - Allow file systems to increase the minimum writeback chunk size. The relatively low minimal writeback size of 4MiB means that written back inodes on rotational media are switched a lot. Besides introducing additional seeks, this also can lead to extreme file fragmentation on zoned devices when a lot of files are cached relative to the available writeback bandwidth. This adds a superblock field that allows the file system to override the default size, and sets it to the zone size for zoned XFS. - Add logging for slow writeback when it exceeds sysctl_hung_task_timeout_secs. This helps identify tasks waiting for a long time and pinpoint potential issues. Recording the starting jiffies is also useful when debugging a crashed vmcore. - Wake up waiting tasks when finishing the writeback of a chunk Cleanups: - filemap_* writeback interface cleanups. Adding filemap_fdatawrite_wbc ended up being a mistake, as all but the original btrfs caller should be using better high level interfaces instead. This series removes all these low-level interfaces, switches btrfs to a more specific interface, and cleans up other too low-level interfaces. With this the writeback_control that is passed to the writeback code is only initialized in three places. - Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and filemap_fdatawrite_wbc - Add filemap_flush_nr helper for btrfs - Push struct writeback_control into start_delalloc_inodes in btrfs - Rename filemap_fdatawrite_range_kick to filemap_flush_range - Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm - Make wbc_to_tag() inline and use it in fs" * tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Make wbc_to_tag() inline and use it in fs. xfs: set s_min_writeback_pages for zoned file systems writeback: allow the file system to override MIN_WRITEBACK_PAGES writeback: cleanup writeback_chunk_size mm: rename filemap_fdatawrite_range_kick to filemap_flush_range mm: remove __filemap_fdatawrite_range mm: remove filemap_fdatawrite_wbc mm: remove __filemap_fdatawrite mm,btrfs: add a filemap_flush_nr helper btrfs: push struct writeback_control into start_delalloc_inodes btrfs: use the local tmp_inode variable in start_delalloc_inodes ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers 9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs) writeback: Wake up waiting tasks when finishing the writeback of a chunk.
9 daysMerge tag 'vfs-6.19-rc1.inode' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode updates from Christian Brauner: "Features: - Hide inode->i_state behind accessors. Open-coded accesses prevent asserting they are done correctly. One obvious aspect is locking, but significantly more can be checked. For example it can be detected when the code is clearing flags which are already missing, or is setting flags when it is illegal (e.g., I_FREEING when ->i_count > 0) - Provide accessors for ->i_state, converts all filesystems using coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2, overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to compile - Rework I_NEW handling to operate without fences, simplifying the code after the accessor infrastructure is in place Cleanups: - Move wait_on_inode() from writeback.h to fs.h - Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb for clarity - Cosmetic fixes to LRU handling - Push list presence check into inode_io_list_del() - Touch up predicts in __d_lookup_rcu() - ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage - Assert on ->i_count in iput_final() - Assert ->i_lock held in __iget() Fixes: - Add missing fences to I_NEW handling" * tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits) dcache: touch up predicts in __d_lookup_rcu() fs: push list presence check into inode_io_list_del() fs: cosmetic fixes to lru handling fs: rework I_NEW handling to operate without fences fs: make plain ->i_state access fail to compile xfs: use the new ->i_state accessors nilfs2: use the new ->i_state accessors overlayfs: use the new ->i_state accessors gfs2: use the new ->i_state accessors f2fs: use the new ->i_state accessors smb: use the new ->i_state accessors ceph: use the new ->i_state accessors btrfs: use the new ->i_state accessors Manual conversion to use ->i_state accessors of all places not covered by coccinelle Coccinelle-based conversion to use ->i_state accessors fs: provide accessors for ->i_state fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb fs: move wait_on_inode() from writeback.h to fs.h fs: add missing fences to I_NEW handling ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage ...
9 daysMerge tag 'vfs-6.19-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Cheaper MAY_EXEC handling for path lookup. This elides MAY_WRITE permission checks during path lookup and adds the IOP_FASTPERM_MAY_EXEC flag so filesystems like btrfs can avoid expensive permission work. - Hide dentry_cache behind runtime const machinery. - Add German Maglione as virtiofs co-maintainer. Cleanups: - Tidy up and inline step_into() and walk_component() for improved code generation. - Re-enable IOCB_NOWAIT writes to files. This refactors file timestamp update logic, fixing a layering bypass in btrfs when updating timestamps on device files and improving FMODE_NOCMTIME handling in VFS now that nfsd started using it. - Path lookup optimizations extracting slowpaths into dedicated routines and adding branch prediction hints for mntput_no_expire(), fd_install(), lookup_slow(), and various other hot paths. - Enable clang's -fms-extensions flag, requiring a JFS rename to avoid conflicts. - Remove spurious exports in fs/file_attr.c. - Stop duplicating union pipe_index declaration. This depends on the shared kbuild branch that brings in -fms-extensions support which is merged into this branch. - Use MD5 library instead of crypto_shash in ecryptfs. - Use largest_zero_folio() in iomap_dio_zero(). - Replace simple_strtol/strtoul with kstrtoint/kstrtouint in init and initrd code. - Various typo fixes. Fixes: - Fix emergency sync for btrfs. Btrfs requires an explicit sync_fs() call with wait == 1 to commit super blocks. The emergency sync path never passed this, leaving btrfs data uncommitted during emergency sync. - Use local kmap in watch_queue's post_one_notification(). - Add hint prints in sb_set_blocksize() for LBS dependency on THP" * tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) MAINTAINERS: add German Maglione as virtiofs co-maintainer fs: inline step_into() and walk_component() fs: tidy up step_into() & friends before inlining orangefs: use inode_update_timestamps directly btrfs: fix the comment on btrfs_update_time btrfs: use vfs_utimes to update file timestamps fs: export vfs_utimes fs: lift the FMODE_NOCMTIME check into file_update_time_flags fs: refactor file timestamp update logic include/linux/fs.h: trivial fix: regualr -> regular fs/splice.c: trivial fix: pipes -> pipe's fs: mark lookup_slow() as noinline fs: add predicts based on nd->depth fs: move mntput_no_expire() slowpath into a dedicated routine fs: remove spurious exports in fs/file_attr.c watch_queue: Use local kmap in post_one_notification() fs: touch up predicts in path lookup fs: move fd_install() slowpath into a dedicated routine and provide commentary fs: hide dentry_cache behind runtime const machinery fs: touch predicts in do_dentry_open() ...
2025-11-26btrfs: fix the comment on btrfs_update_timeChristoph Hellwig
Since commit e41f941a2311 ("Btrfs: move over to use ->update_time") this is not a copy of the high-level file_update_time helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251120064859.2911749-6-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-25btrfs: remaining BTRFS_PATH_AUTO_FREE conversionsDavid Sterba
Do the remaining btrfs_path conversion to the auto cleaning, this seems to be the last one. Most of the conversions are trivial, only adding the declaration and removing the freeing, or changing the goto patterns to return. There are some functions with many changes, like __btrfs_free_extent(), btrfs_remove_from_free_space_tree() or btrfs_add_to_free_space_tree() but it still follows the same pattern. Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25btrfs: reduce arguments to btrfs_del_inode_ref_in_log()Filipe Manana
Instead of passing a root and the objectid of the parent directory, just pass the directory inode, as like that we can extract both the root and the objectid, reducing the number of arguments by one. It also makes the function more consistent with other log tree functions in the sense that we pass the inode and not only its objectid. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25btrfs: remove root argument from btrfs_del_dir_entries_in_log()Filipe Manana
There's no need to pass the root as we can extract it from the directory inode, so remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-25btrfs: disable various operations on encrypted inodesOmar Sandoval
Initially, only normal data extents will be encrypted. This change forbids various other bits: - allows reflinking only if both inodes have the same encryption status - disable inline data on encrypted inodes Note: The patch was taken from v5 of fscrypt patchset (https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/) which was handled over time by various people: Omar Sandoval, Sweet Tea Dorminy, Josef Bacik. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: use booleans for delalloc arguments and struct find_free_extent_ctlFilipe Manana
The struct find_free_extent_ctl uses an int for the 'delalloc' field but it's always used as a boolean, and its value is used to be passed to several functions to signal if we are dealing with delalloc. The same goes for the 'is_data' argument from btrfs_reserve_extent(). So change the type from int to bool and move the field definition in the find_free_extent_ctl structure so that it's close to other bool fields and reduces the size of the structure from 144 down to 136 bytes (at the moment it's only declared in the stack of btrfs_reserve_extent(), never allocated otherwise). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: use bool type for btrfs_path members used as booleansFilipe Manana
Many fields of struct btrfs_path are used as booleans but their type is an unsigned int (of one 1 bit width to save space). Change the type to bool keeping the :1 suffix so that they combine with the previous u8 fields in order to save space. This makes the code more clear by using explicit true/false and more in line with the preferred style, preserving the size of the structure. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: make read verification handle bs > ps cases without large foliosQu Wenruo
The current read verification is also relying on large folios to support bs > ps cases, but that introduced quite some limits. To enhance read-repair to support bs > ps without large folios: - Make btrfs_data_csum_ok() to accept an array of paddrs Which can pass the paddrs[] direct into btrfs_calculate_block_csum_pages(). - Make repair_one_sector() to accept an array of paddrs So that it can submit a repair bio backed by regular pages, not only large folios. This requires us to allocate more slots at bio allocation time though. Also since the caller may have only partially advanced the saved_iter for bs > ps cases, we can not directly trust the logical bytenr from saved_iter (can be unaligned), thus a manual round down is necessary for the logical bytenr. - Make btrfs_check_read_bio() to build an array of paddrs The tricky part is that we can only call btrfs_data_csum_ok() after all involved pages are assembled. This means at the call time of btrfs_check_read_bio(), our offset inside the bio is already at the end of the fs block. Thus we must re-calculate @bio_offset for btrfs_data_csum_ok() and repair_one_sector(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: make btrfs_csum_one_bio() handle bs > ps without large foliosQu Wenruo
For bs > ps cases, all folios passed into btrfs_csum_one_bio() are ensured to be backed by large folios. But that requirement excludes features like direct IO and encoded writes. To support bs > ps without large folios, enhance btrfs_csum_one_bio() by: - Split btrfs_calculate_block_csum() into two versions * btrfs_calculate_block_csum_folio() For call sites where a fs block is always backed by a large folio. This will do extra checks on the folio size, build a paddrs[] array, and pass it into the newer btrfs_calculate_block_csum_pages() helper. For now btrfs_check_block_csum() is still using this version. * btrfs_calculate_block_csum_pages() For call sites that may hit a fs block backed by noncontiguous pages. The pages are represented by paddrs[] array, which includes the offset inside the page. This function will do the proper sub-block handling. - Make btrfs_csum_one_bio() to use btrfs_calculate_block_csum_pages() This means we will need to build a local paddrs[] array, and after filling a fs block, do the checksum calculation. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: move and rename CSUM_FMT definitionDavid Sterba
Move the CSUM_FMT* definitions to fs.h where is be the BTRFS_KEY_FMT and add the prefix for consistency. Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: relax btrfs_inode::ordered_tree_lock IRQ locking contextQu Wenruo
We used IRQ version of spinlock for ordered_tree_lock, as btrfs_finish_ordered_extent() can be called in end_bbio_data_write() which was in IRQ context. However since we're moving all the btrfs_bio::end_io() calls into task context, there is no more need to support IRQ context thus we can relax to regular spin_lock()/spin_unlock() for btrfs_inode::ordered_tree_lock. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove btrfs_bio::fs_info by extracting it from btrfs_bio::inodeQu Wenruo
Currently there is only one caller which doesn't populate btrfs_bio::inode, and that's scrub. The idea is scrub doesn't want any automatic csum verification nor read-repair, as everything will be handled by scrub itself. However that behavior is really no different than metadata inode, thus we can reuse btree_inode as btrfs_bio::inode for scrub. The only exception is in btrfs_submit_chunk() where if a bbio is from scrub or data reloc inode, we set rst_search_commit_root to true. This means we still need a way to distinguish scrub from metadata, but that can be done by a new flag inside btrfs_bio. Now btrfs_bio::inode is a mandatory parameter, we can extract fs_info from that inode thus can remove btrfs_bio::fs_info to save 8 bytes from btrfs_bio structure. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: headers cleanup to remove unnecessary local includesQu Wenruo
[BUG] When I tried to remove btrfs_bio::fs_info and use btrfs_bio::inode to grab the fs_info, the header "btrfs_inode.h" is needed to access the full btrfs_inode structure. Then btrfs will fail to compile. [CAUSE] There is a recursive including chain: "bio.h" -> "btrfs_inode.h" -> "extent_map.h" -> "compression.h" -> "bio.h" That recursive including is causing problems for btrfs. [ENHANCEMENT] To reduce the risk of recursive including: - Remove unnecessary local includes from btrfs headers Either the included header is pulled in by other headers, or is completely unnecessary. - Remove btrfs local includes if the header only requires a pointer In that case let the implementing C file to pull the required header. This is especially important for headers like "btrfs_inode.h" which pulls in a lot of other btrfs headers, thus it's a mine field of recursive including. - Remove unnecessary temporary structure definition Either if we have included the header defining the structure, or completely unused. Now including "btrfs_inode.h" inside "bio.h" is completely fine, although "btrfs_inode.h" still includes "extent_map.h", but that header only includes "fs.h", no more chain back to "bio.h". Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: declare free_ipath() via DEFINE_FREE()Miquel Sabaté Solà
The free_ipath() function was being used as a cleanup function everywhere. Declare it via DEFINE_FREE() so we can use this function with the __free() helper. The name has also been adjusted so it's closer to the type's name. Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: use the key format macros when printing keysFilipe Manana
Change all locations that print a key to use the new macros to print them in order to ensure a consistent style and avoid repetitive code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: avoid multiple i_size rounding in btrfs_truncate()Filipe Manana
We have the inode locked so no one can concurrently change its i_size and neither do we change it ourselves, so there's no point in keep rounding it in the while loop and setting it up in the control structure. That only causes confusion when reading the code. So move all the i_size setup and rounding out of the loop and assert the inode is locked. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: consistently round up or down i_size in btrfs_truncate()Filipe Manana
We're using different ways to round down the i_size by sector size, one with a bitwise and with a negated mask and another with ALIGN_DOWN(), and using ALIGN() to round up. Replace these uses with the round_down() and round_up() macros which have have names that make it clear the direction of the rounding (unlike the ALIGN() macro) and getting rid of the bitwise and, negated mask and local variable for the mask. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: introduce a new shutdown stateQu Wenruo
A new fs state EMERGENCY_SHUTDOWN is introduced, which is btrfs' equivalent of XFS_IOC_GOINGDOWN or EXT4_IOC_SHUTDOWN, after entering emergency shutdown state, all operations will return errors (-EIO), and can not be bring back to normal state until unmouont. The new state will reject the following file operations: - read_iter() - write_iter() - mmap() - open() - remap_file_range() - uring_cmd() - splice_read() This requires a small wrapper to do the extra shutdown check, then call the regular filemap_splice_read() function This should reject most of the file operations on a shutdown btrfs. And for the existing dirty folios, extra shutdown checks are introduced to the following functions: - run_delalloc_nocow() - run_delalloc_compressed() - cow_file_range() So that dirty ranges will still be properly cleaned without being submitted. Finally the shutdown state will also set the fs error, so that no new transaction will be committed, protecting the metadata from any possible further corruption. And when the fs entered shutdown mode for the first time, a critical level kernel message will show up to indicate the incident. That message will be important for end users as rejected delalloc ranges will output error messages, hopefully that shutdown message and the fact that all fs operations are returning error will prevent end users from getting too confused about the delalloc error messages. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Tested-by: Anand Jain <asj@kernel.org> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: fix trivial -Wshadow warningsDavid Sterba
When compiling with -Wshadow (also in 'make W=2' build) there are several reports of shadowed variables that seem to be harmless: - btrfs_do_encoded_write() - we can reuse 'ordered', there's no previous value that would need to be preserved - scrub_write_endio() - we need a standalone 'i' for bio iteration - scrub_stripe() - duplicate ret2 for errors that must not overwrite 'ret' - btrfs_subpage_set_writeback() - 'flags' is used for another irqsave lock but is not overwritten when reused for xarray due to scoping, but for clarity let's rename it - process_dir_items_leaf() - duplicate 'ret', used only for immediate checks Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-12btrfs: utilize IOP_FASTPERM_MAY_EXECMateusz Guzik
Root filesystem was ext4, btrfs was mounted on /testfs. Then issuing access(2) in a loop on /testfs/repos/linux/include/linux/fs.h on Sapphire Rapids (ops/s): before: 3447976 after: 3620879 (+5%) Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20251107142149.989998-3-mjguzik@gmail.com Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11Merge tag 'for-6.18-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix new inode name tracking in tree-log - fix conventional zone and stripe calculations in zoned mode - fix bio reference counts on error paths in relocation and scrub * tag 'for-6.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: release root after error in data_reloc_print_warning_inode() btrfs: scrub: put bio after errors in scrub_raid56_parity_stripe() btrfs: do not update last_log_commit when logging inode due to a new name btrfs: zoned: fix stripe width calculation btrfs: zoned: fix conventional zone capacity calculation
2025-11-05fs: inline current_umask() and move it to fs_struct.hMateusz Guzik
There is no good reason to have this as a func call, other than avoiding the churn of adding fs_struct.h as needed. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20251104170448.630414-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05btrfs: release root after error in data_reloc_print_warning_inode()Zilin Guan
data_reloc_print_warning_inode() calls btrfs_get_fs_root() to obtain local_root, but fails to release its reference when paths_from_inode() returns an error. This causes a potential memory leak. Add a missing btrfs_put_root() call in the error path to properly decrease the reference count of local_root. Fixes: b9a9a85059cde ("btrfs: output affected files when relocation fails") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-04Merge tag 'for-6.18-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix memory leak in qgroup relation ioctl when qgroup levels are invalid - don't write back dirty metadata on filesystem with errors - properly log renamed links - properly mark prealloc extent range beyond inode size as dirty (when no-noles is not enabled) * tag 'for-6.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: mark dirty extent range for out of bound prealloc extents btrfs: set inode flag BTRFS_INODE_COPY_EVERYTHING when logging new name btrfs: fix memory leak of qgroup_list in btrfs_add_qgroup_relation btrfs: ensure no dirty metadata is written back for an fs with errors
2025-10-31btrfs: Use folio_next_pos()Matthew Wilcox (Oracle)
btrfs defined its own variant of folio_next_pos() called folio_end(). This is an ambiguous name as 'end' might be exclusive or inclusive. Switch to the new folio_next_pos(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20251024170822.1427218-3-willy@infradead.org Acked-by: David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.com> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-30btrfs: set inode flag BTRFS_INODE_COPY_EVERYTHING when logging new nameFilipe Manana
If we are logging a new name make sure our inode has the runtime flag BTRFS_INODE_COPY_EVERYTHING set so that at btrfs_log_inode() we will find new inode refs/extrefs in the subvolume tree and copy them into the log tree. We are currently doing it when adding a new link but we are missing it when renaming. An example where this makes a new name not persisted: 1) create symlink with name foo in directory A 2) fsync directory A, which persists the symlink 3) rename the symlink from foo to bar 4) fsync directory A to persist the new symlink name Step 4 isn't working correctly as it's not logging the new name and also leaving the old inode ref in the log tree, so after a power failure the symlink still has the old name of "foo". This is because when we first fsync directoy A we log the symlink's inode (as it's a new entry) and at btrfs_log_inode() we set the log mode to LOG_INODE_ALL and then because we are using that mode and the inode has the runtime flag BTRFS_INODE_NEEDS_FULL_SYNC set, we clear that flag as well as the flag BTRFS_INODE_COPY_EVERYTHING. That means the next time we log the inode, during the rename through the call to btrfs_log_new_name() (calling btrfs_log_inode_parent() and then btrfs_log_inode()), we will not search the subvolume tree for new refs/extrefs and jump directory to the 'log_extents' label. Fix this by making sure we set BTRFS_INODE_COPY_EVERYTHING on an inode when we are about to log a new name. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com> Link: https://lore.kernel.org/linux-btrfs/ac949c74-90c2-4b9a-b7fd-1ffc5c3175c7@gmail.com/ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-29mm,btrfs: add a filemap_flush_nr helperChristoph Hellwig
Abstract out the btrfs-specific behavior of kicking off I/O on a number of pages on an address_space into a well-defined helper. Note: there is no kerneldoc comment for the new function because it is not part of the public API. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251024080431.324236-7-hch@lst.de Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29btrfs: push struct writeback_control into start_delalloc_inodesChristoph Hellwig
In preparation for changing the filemap_fdatawrite_wbc API to not expose the writeback_control to the callers, push the wbc declaration next to the filemap_fdatawrite_wbc call and just pass the nr_to_write value to start_delalloc_inodes. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251024080431.324236-6-hch@lst.de Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29btrfs: use the local tmp_inode variable in start_delalloc_inodesChristoph Hellwig
start_delalloc_inodes has a struct inode * pointer available in the main loop, use it instead of re-calculating it from the btrfs inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251024080431.324236-5-hch@lst.de Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20btrfs: use the new ->i_state accessorsMateusz Guzik
Change generated with coccinelle and fixed up by hand as appropriate. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-02Merge tag 'mm-stable-2025-10-01-19-00' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling * tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ...
2025-09-30Merge tag 'for-6.18-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "There are no new features, the changes are in the core code, notably tree-log error handling and reporting improvements, and initial support for block size > page size. Performance improvements: - search data checksums in the commit root (previous transaction) to avoid locking contention, this improves parallelism of read heavy/low write workloads, and also reduces transaction commit time; on real and reproducer workload the sync time went from minutes to tens of seconds (workload and numbers are in the changelog) Core: - tree-log updates: - error handling improvements, transaction aborts - add new error state 'O' (printed in status messages) when log replay fails and is aborted - reduced number of btrfs_path allocations when traversing the tree - 'block size > page size' support - basic implementation with limitations, under experimental build - limitations: no direct io, raid56, encoded read (standalone and in send ioctl), encoded write - preparatory work for compression, removing implicit assumptions of page and block sizes - compression workspaces are now per-filesystem, we cannot assume common block size for work memory among different filesystems - tree-checker now verifies INODE_EXTREF item (which is implementing hardlinks) - tree leaf pretty printer updates, there were missing data from items, keys/items - move config option CONFIG_BTRFS_REF_VERIFY to CONFIG_BTRFS_DEBUG, it's a debugging feature and not needed to be enabled separately - more struct btrfs_path auto free updates - use ref_tracker API for tracking delayed inodes, enabled by mount option 'ref_verify', allowing to better pinpoint leaking references - in zoned mode, avoid selecting data relocation zoned for ordinary data block groups - updated and enhanced error messages - lots of cleanups and refactoring" * tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (113 commits) btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot() btrfs: add unlikely annotations to branches leading to transaction abort btrfs: add unlikely annotations to branches leading to EIO btrfs: add unlikely annotations to branches leading to EUCLEAN btrfs: more trivial BTRFS_PATH_AUTO_FREE conversions btrfs: zoned: don't fail mount needlessly due to too many active zones btrfs: use kmalloc_array() for open-coded arithmetic in kmalloc() btrfs: enable experimental bs > ps support btrfs: add extra ASSERT()s to catch unaligned bios btrfs: fix symbolic link reading when bs > ps btrfs: prepare scrub to support bs > ps cases btrfs: prepare zlib to support bs > ps cases btrfs: prepare lzo to support bs > ps cases btrfs: prepare zstd to support bs > ps cases btrfs: prepare compression folio alloc/free for bs > ps cases btrfs: fix the incorrect max_bytes value for find_lock_delalloc_range() btrfs: remove pointless key offset setup in create_pending_snapshot() btrfs: annotate btrfs_is_testing() as unlikely and make it return bool btrfs: make the rule checking more readable for should_cow_block() btrfs: simplify inline extent end calculation at replay_one_extent() ...
2025-09-29Merge tag 'vfs-6.18-rc1.inode' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode updates from Christian Brauner: "This contains a series I originally wrote and that Eric brought over the finish line. It moves out the i_crypt_info and i_verity_info pointers out of 'struct inode' and into the fs-specific part of the inode. So now the few filesytems that actually make use of this pay the price in their own private inode storage instead of forcing it upon every user of struct inode. The pointer for the crypt and verity info is simply found by storing an offset to its address in struct fsverity_operations and struct fscrypt_operations. This shrinks struct inode by 16 bytes. I hope to move a lot more out of it in the future so that struct inode becomes really just about very core stuff that we need, much like struct dentry and struct file, instead of the dumping ground it has become over the years. On top of this are a various changes associated with the ongoing inode lifetime handling rework that multiple people are pushing forward: - Stop accessing inode->i_count directly in f2fs and gfs2. They simply should use the __iget() and iput() helpers - Make the i_state flags an enum - Rework the iput() logic Currently, if we are the last iput, and we have the I_DIRTY_TIME bit set, we will grab a reference on the inode again and then mark it dirty and then redo the put. This is to make sure we delay the time update for as long as possible We can rework this logic to simply dec i_count if it is not 1, and if it is do the time update while still holding the i_count reference Then we can replace the atomic_dec_and_lock with locking the ->i_lock and doing atomic_dec_and_test, since we did the atomic_add_unless above - Add an icount_read() helper and convert everyone that accesses inode->i_count directly for this purpose to use the helper - Expand dump_inode() to dump more information about an inode helping in debugging - Add some might_sleep() annotations to iput() and associated helpers" * tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add might_sleep() annotation to iput() and more fs: expand dump_inode() inode: fix whitespace issues fs: add an icount_read helper fs: rework iput logic fs: make the i_state flags an enum fs: stop accessing ->i_count directly in f2fs and gfs2 fsverity: check IS_VERITY() in fsverity_cleanup_inode() fs: remove inode::i_verity_info btrfs: move verity info pointer to fs-specific part of inode f2fs: move verity info pointer to fs-specific part of inode ext4: move verity info pointer to fs-specific part of inode fsverity: add support for info in fs-specific part of inode fs: remove inode::i_crypt_info ceph: move crypt info pointer to fs-specific part of inode ubifs: move crypt info pointer to fs-specific part of inode f2fs: move crypt info pointer to fs-specific part of inode ext4: move crypt info pointer to fs-specific part of inode fscrypt: add support for info in fs-specific part of inode fscrypt: replace raw loads of info pointer with helper function
2025-09-29Merge tag 'vfs-6.18-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add "initramfs_options" parameter to set initramfs mount options. This allows to add specific mount options to the rootfs to e.g., limit the memory size - Add RWF_NOSIGNAL flag for pwritev2() Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal from being raised when writing on disconnected pipes or sockets. The flag is handled directly by the pipe filesystem and converted to the existing MSG_NOSIGNAL flag for sockets - Allow to pass pid namespace as procfs mount option Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed) This implicit behaviour has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include: * In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs) But this requires forking off a helper process because you cannot just one-shot this using mount(2) * Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process) While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate Things would be much easier if there was a way for userspace to just specify the pidns they want. So this pull request contains changes to implement a new "pidns" argument which can be set using fsconfig(2): fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); Cleanups: - Remove the last references to EXPORT_OP_ASYNC_LOCK - Make file_remove_privs_flags() static - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used - Use try_cmpxchg() in start_dir_add() - Use try_cmpxchg() in sb_init_done_wq() - Replace offsetof() with struct_size() in ioctl_file_dedupe_range() - Remove vfs_ioctl() export - Replace rwlock() with spinlock in epoll code as rwlock causes priority inversion on preempt rt kernels - Make ns_entries in fs/proc/namespaces const - Use a switch() statement() in init_special_inode() just like we do in may_open() - Use struct_size() in dir_add() in the initramfs code - Use str_plural() in rd_load_image() - Replace strcpy() with strscpy() in find_link() - Rename generic_delete_inode() to inode_just_drop() and generic_drop_inode() to inode_generic_drop() - Remove unused arguments from fcntl_{g,s}et_rw_hint() Fixes: - Document @name parameter for name_contains_dotdot() helper - Fix spelling mistake - Always return zero from replace_fd() instead of the file descriptor number - Limit the size for copy_file_range() in compat mode to prevent a signed overflow - Fix debugfs mount options not being applied - Verify the inode mode when loading it from disk in minixfs - Verify the inode mode when loading it from disk in cramfs - Don't trigger automounts with RESOLVE_NO_XDEV If openat2() was called with RESOLVE_NO_XDEV it didn't traverse through automounts, but could still trigger them - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in tracepoints - Fix unused variable warning in rd_load_image() on s390 - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD - Use ns_capable_noaudit() when determining net sysctl permissions - Don't call path_put() under namespace semaphore in listmount() and statmount()" * tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits) fcntl: trim arguments listmount: don't call path_put() under namespace semaphore statmount: don't call path_put() under namespace semaphore pid: use ns_capable_noaudit() when determining net sysctl permissions fs: rename generic_delete_inode() and generic_drop_inode() init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD initramfs: Replace strcpy() with strscpy() in find_link() initrd: Use str_plural() in rd_load_image() initramfs: Use struct_size() helper to improve dir_add() initrd: Fix unused variable warning in rd_load_image() on s390 fs: use the switch statement in init_special_inode() fs/proc/namespaces: make ns_entries const filelock: add FL_RECLAIM to show_fl_flags() macro eventpoll: Replace rwlock with spinlock selftests/proc: add tests for new pidns APIs procfs: add "pidns" mount option pidns: move is-ancestor logic to helper openat2: don't trigger automounts with RESOLVE_NO_XDEV namei: move cross-device check to __traverse_mounts namei: remove LOOKUP_NO_XDEV check from handle_mounts ...
2025-09-23btrfs: add unlikely annotations to branches leading to transaction abortDavid Sterba
The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen. Transaction abort is one such error, the btrfs_abort_transaction() inlines code to check the state and print a warning, this ought to be out of the hot path. The most common pattern is when transaction abort is called after checking a return value and the control flow leads to a quick return. In other cases it may not be necessary to add unlikely() e.g. when the function returns anyway or the control flow is not changed noticeably. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: add unlikely annotations to branches leading to EIODavid Sterba
The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen, where EIO is one of them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: add unlikely annotations to branches leading to EUCLEANDavid Sterba
The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen, where EUCLEAN (a corruption) is one of them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: fix symbolic link reading when bs > psQu Wenruo
[BUG DURING BS > PS TEST] When running the following script on a btrfs whose block size is larger than page size, e.g. 8K block size and 4K page size, it will trigger a kernel BUG: # mkfs.btrfs -s 8k $dev # mount $dev $mnt # mkdir $mnt/dir # ln -s dir $mnt/link # ls $mnt/link The call trace looks like this: BTRFS warning (device dm-2): support for block size 8192 with page size 4096 is experimental, some features may be missing BTRFS info (device dm-2): checking UUID tree BTRFS info (device dm-2): enabling ssd optimizations BTRFS info (device dm-2): enabling free space tree ------------[ cut here ]------------ kernel BUG at /home/adam/linux/include/linux/highmem.h:275! Oops: invalid opcode: 0000 [#1] SMP CPU: 8 UID: 0 PID: 667 Comm: ls Tainted: G OE 6.17.0-rc4-custom+ #283 PREEMPT(full) Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 RIP: 0010:zero_user_segments.constprop.0+0xdc/0xe0 [btrfs] Call Trace: <TASK> btrfs_get_extent.cold+0x85/0x101 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f] btrfs_do_readpage+0x244/0x750 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f] btrfs_read_folio+0x9c/0x100 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f] filemap_read_folio+0x37/0xe0 do_read_cache_folio+0x94/0x3e0 __page_get_link.isra.0+0x20/0x90 page_get_link+0x16/0x40 step_into+0x69b/0x830 path_lookupat+0xa7/0x170 filename_lookup+0xf7/0x200 ? set_ptes.isra.0+0x36/0x70 vfs_statx+0x7a/0x160 do_statx+0x63/0xa0 __x64_sys_statx+0x90/0xe0 do_syscall_64+0x82/0xae0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Please note bs > ps support is still under development and the enablement patch is not even in btrfs development branch. [CAUSE] Btrfs reuses its data folio read path to handle symbolic links, as the symbolic link target is stored as an inline data extent. But for newly created inodes, btrfs only set the minimal order if the target inode is a regular file. Thus for above newly created symbolic link, it doesn't properly respect the minimal folio order, and triggered the above crash. [FIX] Call btrfs_set_inode_mapping_order() unconditionally inside btrfs_create_new_inode(). For symbolic links this will fix the crash as now the folio will meet the minimal order. For regular files this brings no change. For directory/bdev/char and all the other types of inodes, they won't go through the data read path, thus no effect either. Fixes: cc38d178ff33 ("btrfs: enable large data folio support under CONFIG_BTRFS_EXPERIMENTAL") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: prepare compression folio alloc/free for bs > ps casesQu Wenruo
This includes the following preparation for bs > ps cases: - Always alloc/free the folio directly if bs > ps This adds a new @fs_info parameter for btrfs_alloc_compr_folio(), thus affecting all compression algorithms. For btrfs_free_compr_folio() it needs no parameter for now, as we can use the folio size to skip the caching part. For now the change is just to passing a @fs_info into the function, all the folio size assumption is still based on page size. - Properly zero the last folio in compress_file_range() Since the compressed folios can be larger than a page, we need to properly zero the whole folio. - Use correct folio size for btrfs_add_compressed_bio_folios() Instead of page size, use the correct folio size. - Use correct folio size/shift for btrfs_compress_filemap_get_folio() As we are not only using simple page sized folios anymore. - Use correct folio size for btrfs_decompress() There is an ASSERT() making sure the decompressed range is no larger than a page, which will be triggered for bs > ps cases. - Skip readahead for compressed pages Similar to subpage cases. - Make btrfs_alloc_folio_array() to accept a new @order parameter - Add a helper to calculate the minimal folio size All those changes should not affect the existing bs <= ps handling. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: introduce btrfs_bio_for_each_block() helperQu Wenruo
Currently if we want to iterate a bio in block unit, we do something like this: while (iter->bi_size) { struct bio_vec bv = bio_iter_iovec(); /* Do something with using the bv */ bio_advance_iter_single(&bbio->bio, iter, sectorsize); } That's fine for now, but it will not handle future bs > ps, as bio_iter_iovec() returns a single-page bvec, meaning the bv_len will not exceed page size. This means the code using that bv can only handle a block if bs <= ps. To address this problem and handle future bs > ps cases better: - Introduce a helper btrfs_bio_for_each_block() Instead of bio_vec, which has single and multiple page version and multiple page version has quite some limits, use my favorite way to represent a block, phys_addr_t. For bs <= ps cases, nothing is changed, except we will do a very small overhead to convert phys_addr_t to a folio, then use the proper folio helpers to handle the possible highmem cases. For bs > ps cases, all blocks will be backed by large folios, meaning every folio will cover at least one block. And still use proper folio helpers to handle highmem cases. With phys_addr_t, we will handle both large folio and highmem properly. So there is no better single variable to present a btrfs block than phys_addr_t. - Extract the data block csum calculation into a helper The new helper, btrfs_calculate_block_csum() will be utilized by btrfs_csum_one_bio(). - Use btrfs_bio_for_each_block() to replace existing call sites Including: * index_one_bio() from raid56.c Very straight-forward. * btrfs_check_read_bio() Also update repair_one_sector() to grab the folio using phys_addr_t, and do extra checks to make sure the folio covers at least one block. We do not need to bother bv_len at all now. * btrfs_csum_one_bio() Now we can move the highmem handling into a dedicated helper, calculate_block_csum(), and use btrfs_bio_for_each_block() helper. There is one exception in btrfs_decompress_buf2page(), which is copying decompressed data into the original bio, which is not iterating using block size thus we don't need to bother. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: concentrate highmem handling for data verificationQu Wenruo
Currently for btrfs checksum verification, we do it in the following pattern: kaddr = kmap_local_*(); ret = btrfs_check_csum_csum(kaddr); kunmap_local(kaddr); It's OK for now, but it's still not following the patterns of helpers inside linux/highmem.h, which never requires a virt memory address. In those highmem helpers, they mostly accept a folio, some offset/length inside the folio, and in the implementation they check if the folio needs partial kmap, and do the handling. Inspired by those formal highmem helpers, enhance the highmem handling of data checksum verification by: - Rename btrfs_check_sector_csum() to btrfs_check_block_csum() To follow the more common term "block" used in all other major filesystems. - Pass a physical address into btrfs_check_block_csum() and btrfs_data_csum_ok() The physical address is always available even for a highmem page. Since it's page frame number << PAGE_SHIFT + offset in page. And with that physical address, we can grab the folio covering the page, and do extra checks to ensure it covers at least one block. This also allows us to do the kmap inside btrfs_check_block_csum(). This means all the extra HIGHMEM handling will be concentrated into btrfs_check_block_csum(), and no callers will need to bother highmem by themselves. - Properly zero out the block if csum mismatch Since btrfs_data_csum_ok() only got a paddr, we can not and should not use memzero_bvec(), which only accepts single page bvec. Instead use paddr to grab the folio and call folio_zero_range() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: fix typos in comments and stringsDavid Sterba
Annual typo fixing pass. Strangely codespell found only about 30% of what is in this patch, the rest was done manually using text spellchecker with a custom dictionary of acceptable terms. Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: keep folios locked inside run_delalloc_nocow()Qu Wenruo
[BUG] There is a very low chance that DEBUG_WARN() inside btrfs_writepage_cow_fixup() can be triggered when CONFIG_BTRFS_EXPERIMENTAL is enabled. This only happens after run_delalloc_nocow() failed. Unfortunately I haven't hit it for a while thus no real world dmesg for now. [CAUSE] There is a race window where after run_delalloc_nocow() failed, error handling can race with writeback thread. Before we hit run_delalloc_nocow(), there is an inode with the following dirty pages: (4K page size, 4K block size, no large folio) 0 4K 8K 12K 16K |/////////|///////////|///////////|////////////| The inode also have NODATACOW flag, and the above dirty range will go through different extents during run_delalloc_range(): 0 4K 8K 12K 16K | NOCOW | COW | COW | NOCOW | The race happen like this: writeback thread A | writeback thread B ----------------------------------+-------------------------------------- Writeback for folio 0 | run_delalloc_nocow() | |- nocow_one_range() | | For range [0, 4K), ret = 0 | | | |- fallback_to_cow() | | For range [4K, 8K), ret = 0 | | Folio 4K *UNLOCKED* | | | Writeback for folio 4K |- fallback_to_cow() | extent_writepage() | For range [8K, 12K), failure | |- writepage_delalloc() | | | |- btrfs_cleanup_ordered_extents()| | |- btrfs_folio_clear_ordered() | | | Folio 0 still locked, safe | | | | | Ordered extent already allocated. | | | Nothing to do. | | |- extent_writepage_io() | | |- btrfs_writepage_cow_fixup() |- btrfs_folio_clear_ordered() | | Folio 4K hold by thread B, | | UNSAFE! | |- btrfs_test_ordered() | | Cleared by thread A, | | | |- DEBUG_WARN(); This is only possible after run_delalloc_nocow() failure, as cow_file_range() will keep all folios and io tree range locked, until everything is finished or after error handling. The root cause is we allow fallback_to_cow() and nocow_one_range() to unlock the folios after a successful run, so that during error handling we're no longer safe to use btrfs_cleanup_ordered_extents() as the folios are already unlocked. [FIX] - Make fallback_to_cow() and nocow_one_range() to keep folios locked after a successful run For fallback_to_cow() we can pass COW_FILE_RANGE_KEEP_LOCKED flag into cow_file_range(). For nocow_one_range() we have to remove the PAGE_UNLOCK flag from extent_clear_unlock_delalloc(). - Unlock folios if everything is fine in run_delalloc_nocow() - Use extent_clear_unlock_delalloc() to handle range [@start, @cur_offset) inside run_delalloc_nocow() Since folios are still locked, we do not need cleanup_dirty_folios() to do the cleanup. extent_clear_unlock_delalloc() with "PAGE_START_WRITEBACK | PAGE_END_WRITEBACK" will clear the dirty flags. - Remove cleanup_dirty_folios() Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: make nocow_one_range() to do cleanup on errorQu Wenruo
Currently if we hit an error inside nocow_one_range(), we do not clear the page dirty, and let the caller to handle it. This is very different compared to fallback_to_cow(), when that function failed, everything will be cleaned up by cow_file_range(). Enhance the situation by: - Use a common error handling for nocow_one_range() If we failed anything, use the same btrfs_cleanup_ordered_extents() and extent_clear_unlock_delalloc(). btrfs_cleanup_ordered_extents() is safe even if we haven't created new ordered extent, in that case there should be no OE and that function will do nothing. The same applies to extent_clear_unlock_delalloc(), and since we're passing PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK, it will also clear folio dirty flag during error handling. - Avoid touching the failed range of nocow_one_range() As the failed range will be cleaned up and unlocked by that function. Here we introduce a new variable @nocow_end to record the failed range, so that we can skip it during the error handling of run_delalloc_nocow(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>