| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Features:
- shutdown ioctl support (needs CONFIG_BTRFS_EXPERIMENTAL for now):
- set filesystem state as being shut down (also named going down
in other filesystems), where all active operations return EIO
and this cannot be changed until unmount
- pending operations are attempted to be finished but error
messages may still show up depending on where exactly the
shutdown happened
- scrub (and device replace) vs suspend/hibernate:
- a running scrub will prevent suspend, which can be annoying as
suspend is an immediate request and scrub is not critical
- filesystem freezing before suspend was not sufficient as the
problem was in process freezing
- behaviour change: on suspend scrub and device replace are
cancelled, where scrub can record the last state and continue
from there; the device replace has to be restarted from the
beginning
- zone stats exported in sysfs, from the perspective of the
filesystem this includes active, reclaimable, relocation etc zones
Performance:
- improvements when processing space reservation tickets by
optimizing locking and shrinking critical sections, cumulative
improvements in lockstat numbers show +15%
Notable fixes:
- use vmalloc fallback when allocating bios as high order allocations
can happen with wide checksums (like sha256)
- scrub will always track the last position of progress so it's not
starting from zero after an error
Core:
- under experimental config, checksum calculations are offloaded to
process context, simplifies locking and allows to remove
compression write worker kthread(s):
- speed improvement in direct IO throughput with buffered IO
fallback is +15% when not offloaded but this is more related to
internal crypto subsystem improvements
- this will be probably default in the future removing the sysfs
tunable
- (experimental) block size > page size updates:
- support more operations when not using large folios (encoded
read/write and send)
- raid56
- more preparations for fscrypt support
Other:
- more conversions to auto-cleaned variables
- parameter cleanups and removals
- extended warning fixes
- improved printing of structured values like keys
- lots of other cleanups and refactoring"
* tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits)
btrfs: remove unnecessary inode key in btrfs_log_all_parents()
btrfs: remove redundant zero/NULL initializations in btrfs_alloc_root()
btrfs: remaining BTRFS_PATH_AUTO_FREE conversions
btrfs: send: do not allocate memory for xattr data when checking it exists
btrfs: send: add unlikely to all unexpected overflow checks
btrfs: reduce arguments to btrfs_del_inode_ref_in_log()
btrfs: remove root argument from btrfs_del_dir_entries_in_log()
btrfs: use test_and_set_bit() in btrfs_delayed_delete_inode_ref()
btrfs: don't search back for dir inode item in INO_LOOKUP_USER
btrfs: don't rewrite ret from inode_permission
btrfs: add orig_logical to btrfs_bio for encryption
btrfs: disable verity on encrypted inodes
btrfs: disable various operations on encrypted inodes
btrfs: remove redundant level reset in btrfs_del_items()
btrfs: simplify leaf traversal after path release in btrfs_next_old_leaf()
btrfs: optimize balance_level() path reference handling
btrfs: factor out root promotion logic into promote_child_to_root()
btrfs: raid56: remove the "_step" infix
btrfs: raid56: enable bs > ps support
btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fs header updates from Christian Brauner:
"This contains initial work to start splitting up fs.h.
Begin the long-overdue work of splitting up the monolithic fs.h
header. The header has grown to over 3000 lines and includes types and
functions for many different subsystems, making it difficult to
navigate and causing excessive compilation dependencies.
This series introduces new focused headers for superblock-related
code:
- Rename fs_types.h to fs_dirent.h to better reflect its actual
content (directory entry types)
- Add fs/super_types.h containing superblock type definitions
- Add fs/super.h containing superblock function declarations
This is the first step in a longer effort to modularize the VFS
headers.
Cleanups:
- Inode Field Layout Optimization (Mateusz Guzik)
Move inode fields used during fast path lookup closer together to
improve cache locality during path resolution.
- current_umask() Optimization (Mateusz Guzik)
Inline current_umask() and move it to fs_struct.h. This improves
performance by avoiding function call overhead for this
frequently-used function, and places it in a more appropriate
header since it operates on fs_struct"
* tag 'vfs-6.19-rc1.fs_header' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: move inode fields used during fast path lookup closer together
fs: inline current_umask() and move it to fs_struct.h
fs: add fs/super.h header
fs: add fs/super_types.h header
fs: rename fs_types.h to fs_dirent.h
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull folio updates from Christian Brauner:
"Add a new folio_next_pos() helper function that returns the file
position of the first byte after the current folio. This is a common
operation in filesystems when needing to know the end of the current
folio.
The helper is lifted from btrfs which already had its own version, and
is now used across multiple filesystems and subsystems:
- btrfs
- buffer
- ext4
- f2fs
- gfs2
- iomap
- netfs
- xfs
- mm
This fixes a long-standing bug in ocfs2 on 32-bit systems with files
larger than 2GiB. Presumably this is not a common configuration, but
the fix is backported anyway. The other filesystems did not have bugs,
they were just mildly inefficient.
This also introduce uoff_t as the unsigned version of loff_t. A recent
commit inadvertently changed a comparison from being unsigned (on
64-bit systems) to being signed (which it had always been on 32-bit
systems), leading to sporadic fstests failures.
Generally file sizes are restricted to being a signed integer, but in
places where -1 is passed to indicate "up to the end of the file", it
is convenient to have an unsigned type to ensure comparisons are
always unsigned regardless of architecture"
* tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Add uoff_t
mm: Use folio_next_pos()
xfs: Use folio_next_pos()
netfs: Use folio_next_pos()
iomap: Use folio_next_pos()
gfs2: Use folio_next_pos()
f2fs: Use folio_next_pos()
ext4: Use folio_next_pos()
buffer: Use folio_next_pos()
btrfs: Use folio_next_pos()
filemap: Add folio_next_pos()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull writeback updates from Christian Brauner:
"Features:
- Allow file systems to increase the minimum writeback chunk size.
The relatively low minimal writeback size of 4MiB means that
written back inodes on rotational media are switched a lot. Besides
introducing additional seeks, this also can lead to extreme file
fragmentation on zoned devices when a lot of files are cached
relative to the available writeback bandwidth.
This adds a superblock field that allows the file system to
override the default size, and sets it to the zone size for zoned
XFS.
- Add logging for slow writeback when it exceeds
sysctl_hung_task_timeout_secs. This helps identify tasks waiting
for a long time and pinpoint potential issues. Recording the
starting jiffies is also useful when debugging a crashed vmcore.
- Wake up waiting tasks when finishing the writeback of a chunk
Cleanups:
- filemap_* writeback interface cleanups.
Adding filemap_fdatawrite_wbc ended up being a mistake, as all but
the original btrfs caller should be using better high level
interfaces instead.
This series removes all these low-level interfaces, switches btrfs
to a more specific interface, and cleans up other too low-level
interfaces. With this the writeback_control that is passed to the
writeback code is only initialized in three places.
- Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and
filemap_fdatawrite_wbc
- Add filemap_flush_nr helper for btrfs
- Push struct writeback_control into start_delalloc_inodes in btrfs
- Rename filemap_fdatawrite_range_kick to filemap_flush_range
- Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm
- Make wbc_to_tag() inline and use it in fs"
* tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Make wbc_to_tag() inline and use it in fs.
xfs: set s_min_writeback_pages for zoned file systems
writeback: allow the file system to override MIN_WRITEBACK_PAGES
writeback: cleanup writeback_chunk_size
mm: rename filemap_fdatawrite_range_kick to filemap_flush_range
mm: remove __filemap_fdatawrite_range
mm: remove filemap_fdatawrite_wbc
mm: remove __filemap_fdatawrite
mm,btrfs: add a filemap_flush_nr helper
btrfs: push struct writeback_control into start_delalloc_inodes
btrfs: use the local tmp_inode variable in start_delalloc_inodes
ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers
9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close
mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode
writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs)
writeback: Wake up waiting tasks when finishing the writeback of a chunk.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
"Features:
- Hide inode->i_state behind accessors. Open-coded accesses prevent
asserting they are done correctly. One obvious aspect is locking,
but significantly more can be checked. For example it can be
detected when the code is clearing flags which are already missing,
or is setting flags when it is illegal (e.g., I_FREEING when
->i_count > 0)
- Provide accessors for ->i_state, converts all filesystems using
coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2,
overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to
compile
- Rework I_NEW handling to operate without fences, simplifying the
code after the accessor infrastructure is in place
Cleanups:
- Move wait_on_inode() from writeback.h to fs.h
- Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
for clarity
- Cosmetic fixes to LRU handling
- Push list presence check into inode_io_list_del()
- Touch up predicts in __d_lookup_rcu()
- ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
- Assert on ->i_count in iput_final()
- Assert ->i_lock held in __iget()
Fixes:
- Add missing fences to I_NEW handling"
* tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
dcache: touch up predicts in __d_lookup_rcu()
fs: push list presence check into inode_io_list_del()
fs: cosmetic fixes to lru handling
fs: rework I_NEW handling to operate without fences
fs: make plain ->i_state access fail to compile
xfs: use the new ->i_state accessors
nilfs2: use the new ->i_state accessors
overlayfs: use the new ->i_state accessors
gfs2: use the new ->i_state accessors
f2fs: use the new ->i_state accessors
smb: use the new ->i_state accessors
ceph: use the new ->i_state accessors
btrfs: use the new ->i_state accessors
Manual conversion to use ->i_state accessors of all places not covered by coccinelle
Coccinelle-based conversion to use ->i_state accessors
fs: provide accessors for ->i_state
fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
fs: move wait_on_inode() from writeback.h to fs.h
fs: add missing fences to I_NEW handling
ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Cheaper MAY_EXEC handling for path lookup. This elides MAY_WRITE
permission checks during path lookup and adds the
IOP_FASTPERM_MAY_EXEC flag so filesystems like btrfs can avoid
expensive permission work.
- Hide dentry_cache behind runtime const machinery.
- Add German Maglione as virtiofs co-maintainer.
Cleanups:
- Tidy up and inline step_into() and walk_component() for improved
code generation.
- Re-enable IOCB_NOWAIT writes to files. This refactors file
timestamp update logic, fixing a layering bypass in btrfs when
updating timestamps on device files and improving FMODE_NOCMTIME
handling in VFS now that nfsd started using it.
- Path lookup optimizations extracting slowpaths into dedicated
routines and adding branch prediction hints for mntput_no_expire(),
fd_install(), lookup_slow(), and various other hot paths.
- Enable clang's -fms-extensions flag, requiring a JFS rename to
avoid conflicts.
- Remove spurious exports in fs/file_attr.c.
- Stop duplicating union pipe_index declaration. This depends on the
shared kbuild branch that brings in -fms-extensions support which
is merged into this branch.
- Use MD5 library instead of crypto_shash in ecryptfs.
- Use largest_zero_folio() in iomap_dio_zero().
- Replace simple_strtol/strtoul with kstrtoint/kstrtouint in init and
initrd code.
- Various typo fixes.
Fixes:
- Fix emergency sync for btrfs. Btrfs requires an explicit sync_fs()
call with wait == 1 to commit super blocks. The emergency sync path
never passed this, leaving btrfs data uncommitted during emergency
sync.
- Use local kmap in watch_queue's post_one_notification().
- Add hint prints in sb_set_blocksize() for LBS dependency on THP"
* tag 'vfs-6.19-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
MAINTAINERS: add German Maglione as virtiofs co-maintainer
fs: inline step_into() and walk_component()
fs: tidy up step_into() & friends before inlining
orangefs: use inode_update_timestamps directly
btrfs: fix the comment on btrfs_update_time
btrfs: use vfs_utimes to update file timestamps
fs: export vfs_utimes
fs: lift the FMODE_NOCMTIME check into file_update_time_flags
fs: refactor file timestamp update logic
include/linux/fs.h: trivial fix: regualr -> regular
fs/splice.c: trivial fix: pipes -> pipe's
fs: mark lookup_slow() as noinline
fs: add predicts based on nd->depth
fs: move mntput_no_expire() slowpath into a dedicated routine
fs: remove spurious exports in fs/file_attr.c
watch_queue: Use local kmap in post_one_notification()
fs: touch up predicts in path lookup
fs: move fd_install() slowpath into a dedicated routine and provide commentary
fs: hide dentry_cache behind runtime const machinery
fs: touch predicts in do_dentry_open()
...
|
|
Since commit e41f941a2311 ("Btrfs: move over to use ->update_time") this
is not a copy of the high-level file_update_time helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-6-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Do the remaining btrfs_path conversion to the auto cleaning, this seems
to be the last one. Most of the conversions are trivial, only adding the
declaration and removing the freeing, or changing the goto patterns to
return.
There are some functions with many changes, like __btrfs_free_extent(),
btrfs_remove_from_free_space_tree() or btrfs_add_to_free_space_tree()
but it still follows the same pattern.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of passing a root and the objectid of the parent directory, just
pass the directory inode, as like that we can extract both the root and
the objectid, reducing the number of arguments by one. It also makes the
function more consistent with other log tree functions in the sense that
we pass the inode and not only its objectid.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no need to pass the root as we can extract it from the directory
inode, so remove it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Initially, only normal data extents will be encrypted. This change
forbids various other bits:
- allows reflinking only if both inodes have the same encryption status
- disable inline data on encrypted inodes
Note: The patch was taken from v5 of fscrypt patchset
(https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/)
which was handled over time by various people: Omar Sandoval, Sweet Tea
Dorminy, Josef Bacik.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The struct find_free_extent_ctl uses an int for the 'delalloc' field but
it's always used as a boolean, and its value is used to be passed to
several functions to signal if we are dealing with delalloc. The same goes
for the 'is_data' argument from btrfs_reserve_extent(). So change the type
from int to bool and move the field definition in the find_free_extent_ctl
structure so that it's close to other bool fields and reduces the size of
the structure from 144 down to 136 bytes (at the moment it's only declared
in the stack of btrfs_reserve_extent(), never allocated otherwise).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Many fields of struct btrfs_path are used as booleans but their type is
an unsigned int (of one 1 bit width to save space). Change the type to
bool keeping the :1 suffix so that they combine with the previous u8
fields in order to save space. This makes the code more clear by using
explicit true/false and more in line with the preferred style, preserving
the size of the structure.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The current read verification is also relying on large folios to support
bs > ps cases, but that introduced quite some limits.
To enhance read-repair to support bs > ps without large folios:
- Make btrfs_data_csum_ok() to accept an array of paddrs
Which can pass the paddrs[] direct into
btrfs_calculate_block_csum_pages().
- Make repair_one_sector() to accept an array of paddrs
So that it can submit a repair bio backed by regular pages, not only
large folios.
This requires us to allocate more slots at bio allocation time though.
Also since the caller may have only partially advanced the saved_iter
for bs > ps cases, we can not directly trust the logical bytenr from
saved_iter (can be unaligned), thus a manual round down is necessary
for the logical bytenr.
- Make btrfs_check_read_bio() to build an array of paddrs
The tricky part is that we can only call btrfs_data_csum_ok() after
all involved pages are assembled.
This means at the call time of btrfs_check_read_bio(), our offset
inside the bio is already at the end of the fs block.
Thus we must re-calculate @bio_offset for btrfs_data_csum_ok() and
repair_one_sector().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For bs > ps cases, all folios passed into btrfs_csum_one_bio() are
ensured to be backed by large folios. But that requirement excludes
features like direct IO and encoded writes.
To support bs > ps without large folios, enhance btrfs_csum_one_bio()
by:
- Split btrfs_calculate_block_csum() into two versions
* btrfs_calculate_block_csum_folio()
For call sites where a fs block is always backed by a large folio.
This will do extra checks on the folio size, build a paddrs[] array,
and pass it into the newer btrfs_calculate_block_csum_pages()
helper.
For now btrfs_check_block_csum() is still using this version.
* btrfs_calculate_block_csum_pages()
For call sites that may hit a fs block backed by noncontiguous pages.
The pages are represented by paddrs[] array, which includes the
offset inside the page.
This function will do the proper sub-block handling.
- Make btrfs_csum_one_bio() to use btrfs_calculate_block_csum_pages()
This means we will need to build a local paddrs[] array, and after
filling a fs block, do the checksum calculation.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Move the CSUM_FMT* definitions to fs.h where is be the BTRFS_KEY_FMT
and add the prefix for consistency.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We used IRQ version of spinlock for ordered_tree_lock, as
btrfs_finish_ordered_extent() can be called in end_bbio_data_write()
which was in IRQ context.
However since we're moving all the btrfs_bio::end_io() calls into task
context, there is no more need to support IRQ context thus we can relax
to regular spin_lock()/spin_unlock() for btrfs_inode::ordered_tree_lock.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently there is only one caller which doesn't populate
btrfs_bio::inode, and that's scrub.
The idea is scrub doesn't want any automatic csum verification nor
read-repair, as everything will be handled by scrub itself.
However that behavior is really no different than metadata inode, thus
we can reuse btree_inode as btrfs_bio::inode for scrub.
The only exception is in btrfs_submit_chunk() where if a bbio is from
scrub or data reloc inode, we set rst_search_commit_root to true.
This means we still need a way to distinguish scrub from metadata, but
that can be done by a new flag inside btrfs_bio.
Now btrfs_bio::inode is a mandatory parameter, we can extract fs_info
from that inode thus can remove btrfs_bio::fs_info to save 8 bytes from
btrfs_bio structure.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
When I tried to remove btrfs_bio::fs_info and use btrfs_bio::inode to
grab the fs_info, the header "btrfs_inode.h" is needed to access the
full btrfs_inode structure.
Then btrfs will fail to compile.
[CAUSE]
There is a recursive including chain:
"bio.h" -> "btrfs_inode.h" -> "extent_map.h" -> "compression.h" ->
"bio.h"
That recursive including is causing problems for btrfs.
[ENHANCEMENT]
To reduce the risk of recursive including:
- Remove unnecessary local includes from btrfs headers
Either the included header is pulled in by other headers, or is
completely unnecessary.
- Remove btrfs local includes if the header only requires a pointer
In that case let the implementing C file to pull the required header.
This is especially important for headers like "btrfs_inode.h" which
pulls in a lot of other btrfs headers, thus it's a mine field of
recursive including.
- Remove unnecessary temporary structure definition
Either if we have included the header defining the structure, or
completely unused.
Now including "btrfs_inode.h" inside "bio.h" is completely fine,
although "btrfs_inode.h" still includes "extent_map.h", but that header
only includes "fs.h", no more chain back to "bio.h".
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The free_ipath() function was being used as a cleanup function
everywhere. Declare it via DEFINE_FREE() so we can use this function
with the __free() helper.
The name has also been adjusted so it's closer to the type's name.
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Change all locations that print a key to use the new macros to print
them in order to ensure a consistent style and avoid repetitive code.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have the inode locked so no one can concurrently change its i_size and
neither do we change it ourselves, so there's no point in keep rounding
it in the while loop and setting it up in the control structure. That only
causes confusion when reading the code.
So move all the i_size setup and rounding out of the loop and assert the
inode is locked.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We're using different ways to round down the i_size by sector size, one
with a bitwise and with a negated mask and another with ALIGN_DOWN(), and
using ALIGN() to round up.
Replace these uses with the round_down() and round_up() macros which have
have names that make it clear the direction of the rounding (unlike the
ALIGN() macro) and getting rid of the bitwise and, negated mask and local
variable for the mask.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
A new fs state EMERGENCY_SHUTDOWN is introduced, which is btrfs'
equivalent of XFS_IOC_GOINGDOWN or EXT4_IOC_SHUTDOWN, after entering
emergency shutdown state, all operations will return errors (-EIO), and
can not be bring back to normal state until unmouont.
The new state will reject the following file operations:
- read_iter()
- write_iter()
- mmap()
- open()
- remap_file_range()
- uring_cmd()
- splice_read()
This requires a small wrapper to do the extra shutdown check, then call
the regular filemap_splice_read() function
This should reject most of the file operations on a shutdown btrfs.
And for the existing dirty folios, extra shutdown checks are introduced
to the following functions:
- run_delalloc_nocow()
- run_delalloc_compressed()
- cow_file_range()
So that dirty ranges will still be properly cleaned without being
submitted.
Finally the shutdown state will also set the fs error, so that no new
transaction will be committed, protecting the metadata from any possible
further corruption.
And when the fs entered shutdown mode for the first time, a critical
level kernel message will show up to indicate the incident.
That message will be important for end users as rejected delalloc ranges
will output error messages, hopefully that shutdown message and the fact
that all fs operations are returning error will prevent end users from
getting too confused about the delalloc error messages.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Tested-by: Anand Jain <asj@kernel.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When compiling with -Wshadow (also in 'make W=2' build) there are
several reports of shadowed variables that seem to be harmless:
- btrfs_do_encoded_write() - we can reuse 'ordered', there's no previous
value that would need to be preserved
- scrub_write_endio() - we need a standalone 'i' for bio iteration
- scrub_stripe() - duplicate ret2 for errors that must not overwrite 'ret'
- btrfs_subpage_set_writeback() - 'flags' is used for another irqsave lock
but is not overwritten when reused for xarray
due to scoping, but for clarity let's rename it
- process_dir_items_leaf() - duplicate 'ret', used only for immediate checks
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Root filesystem was ext4, btrfs was mounted on /testfs.
Then issuing access(2) in a loop on /testfs/repos/linux/include/linux/fs.h
on Sapphire Rapids (ops/s):
before: 3447976
after: 3620879 (+5%)
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251107142149.989998-3-mjguzik@gmail.com
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix new inode name tracking in tree-log
- fix conventional zone and stripe calculations in zoned mode
- fix bio reference counts on error paths in relocation and scrub
* tag 'for-6.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: release root after error in data_reloc_print_warning_inode()
btrfs: scrub: put bio after errors in scrub_raid56_parity_stripe()
btrfs: do not update last_log_commit when logging inode due to a new name
btrfs: zoned: fix stripe width calculation
btrfs: zoned: fix conventional zone capacity calculation
|
|
There is no good reason to have this as a func call, other than avoiding
the churn of adding fs_struct.h as needed.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251104170448.630414-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
data_reloc_print_warning_inode() calls btrfs_get_fs_root() to obtain
local_root, but fails to release its reference when paths_from_inode()
returns an error. This causes a potential memory leak.
Add a missing btrfs_put_root() call in the error path to properly
decrease the reference count of local_root.
Fixes: b9a9a85059cde ("btrfs: output affected files when relocation fails")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix memory leak in qgroup relation ioctl when qgroup levels are
invalid
- don't write back dirty metadata on filesystem with errors
- properly log renamed links
- properly mark prealloc extent range beyond inode size as dirty (when
no-noles is not enabled)
* tag 'for-6.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: mark dirty extent range for out of bound prealloc extents
btrfs: set inode flag BTRFS_INODE_COPY_EVERYTHING when logging new name
btrfs: fix memory leak of qgroup_list in btrfs_add_qgroup_relation
btrfs: ensure no dirty metadata is written back for an fs with errors
|
|
btrfs defined its own variant of folio_next_pos() called folio_end().
This is an ambiguous name as 'end' might be exclusive or inclusive.
Switch to the new folio_next_pos().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-3-willy@infradead.org
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
If we are logging a new name make sure our inode has the runtime flag
BTRFS_INODE_COPY_EVERYTHING set so that at btrfs_log_inode() we will find
new inode refs/extrefs in the subvolume tree and copy them into the log
tree.
We are currently doing it when adding a new link but we are missing it
when renaming.
An example where this makes a new name not persisted:
1) create symlink with name foo in directory A
2) fsync directory A, which persists the symlink
3) rename the symlink from foo to bar
4) fsync directory A to persist the new symlink name
Step 4 isn't working correctly as it's not logging the new name and also
leaving the old inode ref in the log tree, so after a power failure the
symlink still has the old name of "foo". This is because when we first
fsync directoy A we log the symlink's inode (as it's a new entry) and at
btrfs_log_inode() we set the log mode to LOG_INODE_ALL and then because
we are using that mode and the inode has the runtime flag
BTRFS_INODE_NEEDS_FULL_SYNC set, we clear that flag as well as the flag
BTRFS_INODE_COPY_EVERYTHING. That means the next time we log the inode,
during the rename through the call to btrfs_log_new_name() (calling
btrfs_log_inode_parent() and then btrfs_log_inode()), we will not search
the subvolume tree for new refs/extrefs and jump directory to the
'log_extents' label.
Fix this by making sure we set BTRFS_INODE_COPY_EVERYTHING on an inode
when we are about to log a new name. A test case for fstests will follow
soon.
Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/ac949c74-90c2-4b9a-b7fd-1ffc5c3175c7@gmail.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Abstract out the btrfs-specific behavior of kicking off I/O on a number
of pages on an address_space into a well-defined helper.
Note: there is no kerneldoc comment for the new function because it is
not part of the public API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-7-hch@lst.de
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In preparation for changing the filemap_fdatawrite_wbc API to not expose
the writeback_control to the callers, push the wbc declaration next to
the filemap_fdatawrite_wbc call and just pass the nr_to_write value to
start_delalloc_inodes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-6-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
start_delalloc_inodes has a struct inode * pointer available in the
main loop, use it instead of re-calculating it from the btrfs inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-5-hch@lst.de
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Change generated with coccinelle and fixed up by hand as appropriate.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "mm, swap: improve cluster scan strategy" from Kairui Song improves
performance and reduces the failure rate of swap cluster allocation
- "support large align and nid in Rust allocators" from Vitaly Wool
permits Rust allocators to set NUMA node and large alignment when
perforning slub and vmalloc reallocs
- "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
DAMOS_STAT's handling of the DAMON operations sets for virtual
address spaces for ops-level DAMOS filters
- "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
Baghdasaryan reduces mmap_lock contention during reads of
/proc/pid/maps
- "mm/mincore: minor clean up for swap cache checking" from Kairui Song
performs some cleanup in the swap code
- "mm: vm_normal_page*() improvements" from David Hildenbrand provides
code cleanup in the pagemap code
- "add persistent huge zero folio support" from Pankaj Raghav provides
a block layer speedup by optionalls making the
huge_zero_pagepersistent, instead of releasing it when its refcount
falls to zero
- "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
the recently added Kexec Handover feature
- "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
Stoakes turns mm_struct.flags into a bitmap. To end the constant
struggle with space shortage on 32-bit conflicting with 64-bit's
needs
- "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
code
- "selftests/mm: Fix false positives and skip unsupported tests" from
Donet Tom fixes a few things in our selftests code
- "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
from David Hildenbrand "allows individual processes to opt-out of
THP=always into THP=madvise, without affecting other workloads on the
system".
It's a long story - the [1/N] changelog spells out the considerations
- "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
the memdesc project. Please see
https://kernelnewbies.org/MatthewWilcox/Memdescs and
https://blogs.oracle.com/linux/post/introducing-memdesc
- "Tiny optimization for large read operations" from Chi Zhiling
improves the efficiency of the pagecache read path
- "Better split_huge_page_test result check" from Zi Yan improves our
folio splitting selftest code
- "test that rmap behaves as expected" from Wei Yang adds some rmap
selftests
- "remove write_cache_pages()" from Christoph Hellwig removes that
function and converts its two remaining callers
- "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
selftests issues
- "introduce kernel file mapped folios" from Boris Burkov introduces
the concept of "kernel file pages". Using these permits btrfs to
account its metadata pages to the root cgroup, rather than to the
cgroups of random inappropriate tasks
- "mm/pageblock: improve readability of some pageblock handling" from
Wei Yang provides some readability improvements to the page allocator
code
- "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
to understand arm32 highmem
- "tools: testing: Use existing atomic.h for vma/maple tests" from
Brendan Jackman performs some code cleanups and deduplication under
tools/testing/
- "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
a couple of 32-bit issues in tools/testing/radix-tree.c
- "kasan: unify kasan_enabled() and remove arch-specific
implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
initialization code into a common arch-neutral implementation
- "mm: remove zpool" from Johannes Weiner removes zspool - an
indirection layer which now only redirects to a single thing
(zsmalloc)
- "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
couple of cleanups in the fork code
- "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
adjustments at various nth_page() callsites, eventually permitting
the removal of that undesirable helper function
- "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
creates a KASAN read-only mode for ARM, using that architecture's
memory tagging feature. It is felt that a read-only mode KASAN is
suitable for use in production systems rather than debug-only
- "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
some tidying in the hugetlb folio allocation code
- "mm: establish const-correctness for pointer parameters" from Max
Kellermann makes quite a number of the MM API functions more accurate
about the constness of their arguments. This was getting in the way
of subsystems (in this case CEPH) when they attempt to improving
their own const/non-const accuracy
- "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
code sites which were confused over when to use free_pages() vs
__free_pages()
- "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
mapletree code accessible to Rust. Required by nouveau and by its
forthcoming successor: the new Rust Nova driver
- "selftests/mm: split_huge_page_test: split_pte_mapped_thp
improvements" from David Hildenbrand adds a fix and some cleanups to
the thp selftesting code
- "mm, swap: introduce swap table as swap cache (phase I)" from Chris
Li and Kairui Song is the first step along the path to implementing
"swap tables" - a new approach to swap allocation and state tracking
which is expected to yield speed and space improvements. This
patchset itself yields a 5-20% performance benefit in some situations
- "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
layer to clean up the ptdesc code a little
- "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
issues in our 5-level pagetable selftesting code
- "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
addresses a couple of minor issues in relatively new memory
allocation profiling feature
- "Small cleanups" from Matthew Wilcox has a few cleanups in
preparation for more memdesc work
- "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
Quanmin Yan makes some changes to DAMON in furtherance of supporting
arm highmem
- "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
Anjum adds that compiler check to selftests code and fixes the
fallout, by removing dead code
- "Improvements to Victim Process Thawing and OOM Reaper Traversal
Order" from zhongjinji makes a number of improvements in the OOM
killer: mainly thawing a more appropriate group of victim threads so
they can release resources
- "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
is a bunch of small and unrelated fixups for DAMON
- "mm/damon: define and use DAMON initialization check function" from
SeongJae Park implement reliability and maintainability improvements
to a recently-added bug fix
- "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
SeongJae Park provides additional transparency to userspace clients
of the DAMON_STAT information
- "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
some constraints on khubepaged's collapsing of anon VMAs. It also
increases the success rate of MADV_COLLAPSE against an anon vma
- "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
from Lorenzo Stoakes moves us further towards removal of
file_operations.mmap(). This patchset concentrates upon clearing up
the treatment of stacked filesystems
- "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
provides some fixes and improvements to mlock's tracking of large
folios. /proc/meminfo's "Mlocked" field became more accurate
- "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
Donet Tom fixes several user-visible KSM stats inaccuracies across
forks and adds selftest code to verify these counters
- "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
some potential but presently benign issues in KSM's mm_slot handling
* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
mm: swap: check for stable address space before operating on the VMA
mm: convert folio_page() back to a macro
mm/khugepaged: use start_addr/addr for improved readability
hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
alloc_tag: fix boot failure due to NULL pointer dereference
mm: silence data-race in update_hiwater_rss
mm/memory-failure: don't select MEMORY_ISOLATION
mm/khugepaged: remove definition of struct khugepaged_mm_slot
mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
hugetlb: increase number of reserving hugepages via cmdline
selftests/mm: add fork inheritance test for ksm_merging_pages counter
mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
drivers/base/node: fix double free in register_one_node()
mm: remove PMD alignment constraint in execmem_vmalloc()
mm/memory_hotplug: fix typo 'esecially' -> 'especially'
mm/rmap: improve mlock tracking for large folios
mm/filemap: map entire large folio faultaround
mm/fault: try to map the entire file folio in finish_fault()
mm/rmap: mlock large folios in try_to_unmap_one()
mm/rmap: fix a mlock race condition in folio_referenced_one()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are no new features, the changes are in the core code, notably
tree-log error handling and reporting improvements, and initial
support for block size > page size.
Performance improvements:
- search data checksums in the commit root (previous transaction) to
avoid locking contention, this improves parallelism of read
heavy/low write workloads, and also reduces transaction commit
time; on real and reproducer workload the sync time went from
minutes to tens of seconds (workload and numbers are in the
changelog)
Core:
- tree-log updates:
- error handling improvements, transaction aborts
- add new error state 'O' (printed in status messages) when log
replay fails and is aborted
- reduced number of btrfs_path allocations when traversing the
tree
- 'block size > page size' support
- basic implementation with limitations, under experimental build
- limitations: no direct io, raid56, encoded read (standalone and
in send ioctl), encoded write
- preparatory work for compression, removing implicit assumptions
of page and block sizes
- compression workspaces are now per-filesystem, we cannot assume
common block size for work memory among different filesystems
- tree-checker now verifies INODE_EXTREF item (which is implementing
hardlinks)
- tree leaf pretty printer updates, there were missing data from
items, keys/items
- move config option CONFIG_BTRFS_REF_VERIFY to CONFIG_BTRFS_DEBUG,
it's a debugging feature and not needed to be enabled separately
- more struct btrfs_path auto free updates
- use ref_tracker API for tracking delayed inodes, enabled by mount
option 'ref_verify', allowing to better pinpoint leaking references
- in zoned mode, avoid selecting data relocation zoned for ordinary
data block groups
- updated and enhanced error messages
- lots of cleanups and refactoring"
* tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (113 commits)
btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot()
btrfs: add unlikely annotations to branches leading to transaction abort
btrfs: add unlikely annotations to branches leading to EIO
btrfs: add unlikely annotations to branches leading to EUCLEAN
btrfs: more trivial BTRFS_PATH_AUTO_FREE conversions
btrfs: zoned: don't fail mount needlessly due to too many active zones
btrfs: use kmalloc_array() for open-coded arithmetic in kmalloc()
btrfs: enable experimental bs > ps support
btrfs: add extra ASSERT()s to catch unaligned bios
btrfs: fix symbolic link reading when bs > ps
btrfs: prepare scrub to support bs > ps cases
btrfs: prepare zlib to support bs > ps cases
btrfs: prepare lzo to support bs > ps cases
btrfs: prepare zstd to support bs > ps cases
btrfs: prepare compression folio alloc/free for bs > ps cases
btrfs: fix the incorrect max_bytes value for find_lock_delalloc_range()
btrfs: remove pointless key offset setup in create_pending_snapshot()
btrfs: annotate btrfs_is_testing() as unlikely and make it return bool
btrfs: make the rule checking more readable for should_cow_block()
btrfs: simplify inline extent end calculation at replay_one_extent()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
"This contains a series I originally wrote and that Eric brought over
the finish line. It moves out the i_crypt_info and i_verity_info
pointers out of 'struct inode' and into the fs-specific part of the
inode.
So now the few filesytems that actually make use of this pay the price
in their own private inode storage instead of forcing it upon every
user of struct inode.
The pointer for the crypt and verity info is simply found by storing
an offset to its address in struct fsverity_operations and struct
fscrypt_operations. This shrinks struct inode by 16 bytes.
I hope to move a lot more out of it in the future so that struct inode
becomes really just about very core stuff that we need, much like
struct dentry and struct file, instead of the dumping ground it has
become over the years.
On top of this are a various changes associated with the ongoing inode
lifetime handling rework that multiple people are pushing forward:
- Stop accessing inode->i_count directly in f2fs and gfs2. They
simply should use the __iget() and iput() helpers
- Make the i_state flags an enum
- Rework the iput() logic
Currently, if we are the last iput, and we have the I_DIRTY_TIME
bit set, we will grab a reference on the inode again and then mark
it dirty and then redo the put. This is to make sure we delay the
time update for as long as possible
We can rework this logic to simply dec i_count if it is not 1, and
if it is do the time update while still holding the i_count
reference
Then we can replace the atomic_dec_and_lock with locking the
->i_lock and doing atomic_dec_and_test, since we did the
atomic_add_unless above
- Add an icount_read() helper and convert everyone that accesses
inode->i_count directly for this purpose to use the helper
- Expand dump_inode() to dump more information about an inode helping
in debugging
- Add some might_sleep() annotations to iput() and associated
helpers"
* tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: add might_sleep() annotation to iput() and more
fs: expand dump_inode()
inode: fix whitespace issues
fs: add an icount_read helper
fs: rework iput logic
fs: make the i_state flags an enum
fs: stop accessing ->i_count directly in f2fs and gfs2
fsverity: check IS_VERITY() in fsverity_cleanup_inode()
fs: remove inode::i_verity_info
btrfs: move verity info pointer to fs-specific part of inode
f2fs: move verity info pointer to fs-specific part of inode
ext4: move verity info pointer to fs-specific part of inode
fsverity: add support for info in fs-specific part of inode
fs: remove inode::i_crypt_info
ceph: move crypt info pointer to fs-specific part of inode
ubifs: move crypt info pointer to fs-specific part of inode
f2fs: move crypt info pointer to fs-specific part of inode
ext4: move crypt info pointer to fs-specific part of inode
fscrypt: add support for info in fs-specific part of inode
fscrypt: replace raw loads of info pointer with helper function
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual selections of misc updates for this cycle.
Features:
- Add "initramfs_options" parameter to set initramfs mount options.
This allows to add specific mount options to the rootfs to e.g.,
limit the memory size
- Add RWF_NOSIGNAL flag for pwritev2()
Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE
signal from being raised when writing on disconnected pipes or
sockets. The flag is handled directly by the pipe filesystem and
converted to the existing MSG_NOSIGNAL flag for sockets
- Allow to pass pid namespace as procfs mount option
Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs
mount is auto-selected based on the mounting process's active
pidns, and the pidns itself is basically hidden once the mount has
been constructed)
This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns
of a procfs mount as desired. Examples include:
* In order to bypass the mnt_too_revealing() check, Kubernetes
creates a procfs mount from an empty pidns so that user
namespaced containers can be nested (without this, the nested
containers would fail to mount procfs)
But this requires forking off a helper process because you cannot
just one-shot this using mount(2)
* Container runtimes in general need to fork into a container
before configuring its mounts, which can lead to security issues
in the case of shared-pidns containers (a privileged process in
the pidns can interact with your container runtime process)
While SUID_DUMP_DISABLE and user namespaces make this less of an
issue, the strict need for this due to a minor uAPI wart is kind
of unfortunate
Things would be much easier if there was a way for userspace to
just specify the pidns they want. So this pull request contains
changes to implement a new "pidns" argument which can be set
using fsconfig(2):
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
Cleanups:
- Remove the last references to EXPORT_OP_ASYNC_LOCK
- Make file_remove_privs_flags() static
- Remove redundant __GFP_NOWARN when GFP_NOWAIT is used
- Use try_cmpxchg() in start_dir_add()
- Use try_cmpxchg() in sb_init_done_wq()
- Replace offsetof() with struct_size() in ioctl_file_dedupe_range()
- Remove vfs_ioctl() export
- Replace rwlock() with spinlock in epoll code as rwlock causes
priority inversion on preempt rt kernels
- Make ns_entries in fs/proc/namespaces const
- Use a switch() statement() in init_special_inode() just like we do
in may_open()
- Use struct_size() in dir_add() in the initramfs code
- Use str_plural() in rd_load_image()
- Replace strcpy() with strscpy() in find_link()
- Rename generic_delete_inode() to inode_just_drop() and
generic_drop_inode() to inode_generic_drop()
- Remove unused arguments from fcntl_{g,s}et_rw_hint()
Fixes:
- Document @name parameter for name_contains_dotdot() helper
- Fix spelling mistake
- Always return zero from replace_fd() instead of the file descriptor
number
- Limit the size for copy_file_range() in compat mode to prevent a
signed overflow
- Fix debugfs mount options not being applied
- Verify the inode mode when loading it from disk in minixfs
- Verify the inode mode when loading it from disk in cramfs
- Don't trigger automounts with RESOLVE_NO_XDEV
If openat2() was called with RESOLVE_NO_XDEV it didn't traverse
through automounts, but could still trigger them
- Add FL_RECLAIM flag to show_fl_flags() macro so it appears in
tracepoints
- Fix unused variable warning in rd_load_image() on s390
- Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD
- Use ns_capable_noaudit() when determining net sysctl permissions
- Don't call path_put() under namespace semaphore in listmount() and
statmount()"
* tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits)
fcntl: trim arguments
listmount: don't call path_put() under namespace semaphore
statmount: don't call path_put() under namespace semaphore
pid: use ns_capable_noaudit() when determining net sysctl permissions
fs: rename generic_delete_inode() and generic_drop_inode()
init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD
initramfs: Replace strcpy() with strscpy() in find_link()
initrd: Use str_plural() in rd_load_image()
initramfs: Use struct_size() helper to improve dir_add()
initrd: Fix unused variable warning in rd_load_image() on s390
fs: use the switch statement in init_special_inode()
fs/proc/namespaces: make ns_entries const
filelock: add FL_RECLAIM to show_fl_flags() macro
eventpoll: Replace rwlock with spinlock
selftests/proc: add tests for new pidns APIs
procfs: add "pidns" mount option
pidns: move is-ancestor logic to helper
openat2: don't trigger automounts with RESOLVE_NO_XDEV
namei: move cross-device check to __traverse_mounts
namei: remove LOOKUP_NO_XDEV check from handle_mounts
...
|
|
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen.
Transaction abort is one such error, the btrfs_abort_transaction()
inlines code to check the state and print a warning, this ought to be
out of the hot path.
The most common pattern is when transaction abort is called after
checking a return value and the control flow leads to a quick return.
In other cases it may not be necessary to add unlikely() e.g. when the
function returns anyway or the control flow is not changed noticeably.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EIO is one of them.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EUCLEAN (a corruption) is one of them.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG DURING BS > PS TEST]
When running the following script on a btrfs whose block size is larger
than page size, e.g. 8K block size and 4K page size, it will trigger a
kernel BUG:
# mkfs.btrfs -s 8k $dev
# mount $dev $mnt
# mkdir $mnt/dir
# ln -s dir $mnt/link
# ls $mnt/link
The call trace looks like this:
BTRFS warning (device dm-2): support for block size 8192 with page size 4096 is experimental, some features may be missing
BTRFS info (device dm-2): checking UUID tree
BTRFS info (device dm-2): enabling ssd optimizations
BTRFS info (device dm-2): enabling free space tree
------------[ cut here ]------------
kernel BUG at /home/adam/linux/include/linux/highmem.h:275!
Oops: invalid opcode: 0000 [#1] SMP
CPU: 8 UID: 0 PID: 667 Comm: ls Tainted: G OE 6.17.0-rc4-custom+ #283 PREEMPT(full)
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
RIP: 0010:zero_user_segments.constprop.0+0xdc/0xe0 [btrfs]
Call Trace:
<TASK>
btrfs_get_extent.cold+0x85/0x101 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
btrfs_do_readpage+0x244/0x750 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
btrfs_read_folio+0x9c/0x100 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
filemap_read_folio+0x37/0xe0
do_read_cache_folio+0x94/0x3e0
__page_get_link.isra.0+0x20/0x90
page_get_link+0x16/0x40
step_into+0x69b/0x830
path_lookupat+0xa7/0x170
filename_lookup+0xf7/0x200
? set_ptes.isra.0+0x36/0x70
vfs_statx+0x7a/0x160
do_statx+0x63/0xa0
__x64_sys_statx+0x90/0xe0
do_syscall_64+0x82/0xae0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
Please note bs > ps support is still under development and the
enablement patch is not even in btrfs development branch.
[CAUSE]
Btrfs reuses its data folio read path to handle symbolic links, as the
symbolic link target is stored as an inline data extent.
But for newly created inodes, btrfs only set the minimal order if the
target inode is a regular file.
Thus for above newly created symbolic link, it doesn't properly respect
the minimal folio order, and triggered the above crash.
[FIX]
Call btrfs_set_inode_mapping_order() unconditionally inside
btrfs_create_new_inode().
For symbolic links this will fix the crash as now the folio will meet
the minimal order.
For regular files this brings no change.
For directory/bdev/char and all the other types of inodes, they won't
go through the data read path, thus no effect either.
Fixes: cc38d178ff33 ("btrfs: enable large data folio support under CONFIG_BTRFS_EXPERIMENTAL")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This includes the following preparation for bs > ps cases:
- Always alloc/free the folio directly if bs > ps
This adds a new @fs_info parameter for btrfs_alloc_compr_folio(), thus
affecting all compression algorithms.
For btrfs_free_compr_folio() it needs no parameter for now, as we can
use the folio size to skip the caching part.
For now the change is just to passing a @fs_info into the function,
all the folio size assumption is still based on page size.
- Properly zero the last folio in compress_file_range()
Since the compressed folios can be larger than a page, we need to
properly zero the whole folio.
- Use correct folio size for btrfs_add_compressed_bio_folios()
Instead of page size, use the correct folio size.
- Use correct folio size/shift for btrfs_compress_filemap_get_folio()
As we are not only using simple page sized folios anymore.
- Use correct folio size for btrfs_decompress()
There is an ASSERT() making sure the decompressed range is no larger
than a page, which will be triggered for bs > ps cases.
- Skip readahead for compressed pages
Similar to subpage cases.
- Make btrfs_alloc_folio_array() to accept a new @order parameter
- Add a helper to calculate the minimal folio size
All those changes should not affect the existing bs <= ps handling.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently if we want to iterate a bio in block unit, we do something
like this:
while (iter->bi_size) {
struct bio_vec bv = bio_iter_iovec();
/* Do something with using the bv */
bio_advance_iter_single(&bbio->bio, iter, sectorsize);
}
That's fine for now, but it will not handle future bs > ps, as
bio_iter_iovec() returns a single-page bvec, meaning the bv_len will not
exceed page size.
This means the code using that bv can only handle a block if bs <= ps.
To address this problem and handle future bs > ps cases better:
- Introduce a helper btrfs_bio_for_each_block()
Instead of bio_vec, which has single and multiple page version and
multiple page version has quite some limits, use my favorite way to
represent a block, phys_addr_t.
For bs <= ps cases, nothing is changed, except we will do a very
small overhead to convert phys_addr_t to a folio, then use the proper
folio helpers to handle the possible highmem cases.
For bs > ps cases, all blocks will be backed by large folios, meaning
every folio will cover at least one block. And still use proper folio
helpers to handle highmem cases.
With phys_addr_t, we will handle both large folio and highmem
properly. So there is no better single variable to present a btrfs
block than phys_addr_t.
- Extract the data block csum calculation into a helper
The new helper, btrfs_calculate_block_csum() will be utilized by
btrfs_csum_one_bio().
- Use btrfs_bio_for_each_block() to replace existing call sites
Including:
* index_one_bio() from raid56.c
Very straight-forward.
* btrfs_check_read_bio()
Also update repair_one_sector() to grab the folio using phys_addr_t,
and do extra checks to make sure the folio covers at least one
block.
We do not need to bother bv_len at all now.
* btrfs_csum_one_bio()
Now we can move the highmem handling into a dedicated helper,
calculate_block_csum(), and use btrfs_bio_for_each_block() helper.
There is one exception in btrfs_decompress_buf2page(), which is copying
decompressed data into the original bio, which is not iterating using
block size thus we don't need to bother.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently for btrfs checksum verification, we do it in the following
pattern:
kaddr = kmap_local_*();
ret = btrfs_check_csum_csum(kaddr);
kunmap_local(kaddr);
It's OK for now, but it's still not following the patterns of helpers
inside linux/highmem.h, which never requires a virt memory address.
In those highmem helpers, they mostly accept a folio, some offset/length
inside the folio, and in the implementation they check if the folio
needs partial kmap, and do the handling.
Inspired by those formal highmem helpers, enhance the highmem handling
of data checksum verification by:
- Rename btrfs_check_sector_csum() to btrfs_check_block_csum()
To follow the more common term "block" used in all other major
filesystems.
- Pass a physical address into btrfs_check_block_csum() and
btrfs_data_csum_ok()
The physical address is always available even for a highmem page.
Since it's page frame number << PAGE_SHIFT + offset in page.
And with that physical address, we can grab the folio covering the
page, and do extra checks to ensure it covers at least one block.
This also allows us to do the kmap inside btrfs_check_block_csum().
This means all the extra HIGHMEM handling will be concentrated into
btrfs_check_block_csum(), and no callers will need to bother highmem
by themselves.
- Properly zero out the block if csum mismatch
Since btrfs_data_csum_ok() only got a paddr, we can not and should not
use memzero_bvec(), which only accepts single page bvec.
Instead use paddr to grab the folio and call folio_zero_range()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Annual typo fixing pass. Strangely codespell found only about 30% of
what is in this patch, the rest was done manually using text
spellchecker with a custom dictionary of acceptable terms.
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
There is a very low chance that DEBUG_WARN() inside
btrfs_writepage_cow_fixup() can be triggered when
CONFIG_BTRFS_EXPERIMENTAL is enabled.
This only happens after run_delalloc_nocow() failed.
Unfortunately I haven't hit it for a while thus no real world dmesg for
now.
[CAUSE]
There is a race window where after run_delalloc_nocow() failed, error
handling can race with writeback thread.
Before we hit run_delalloc_nocow(), there is an inode with the following
dirty pages: (4K page size, 4K block size, no large folio)
0 4K 8K 12K 16K
|/////////|///////////|///////////|////////////|
The inode also have NODATACOW flag, and the above dirty range will go
through different extents during run_delalloc_range():
0 4K 8K 12K 16K
| NOCOW | COW | COW | NOCOW |
The race happen like this:
writeback thread A | writeback thread B
----------------------------------+--------------------------------------
Writeback for folio 0 |
run_delalloc_nocow() |
|- nocow_one_range() |
| For range [0, 4K), ret = 0 |
| |
|- fallback_to_cow() |
| For range [4K, 8K), ret = 0 |
| Folio 4K *UNLOCKED* |
| | Writeback for folio 4K
|- fallback_to_cow() | extent_writepage()
| For range [8K, 12K), failure | |- writepage_delalloc()
| | |
|- btrfs_cleanup_ordered_extents()| |
|- btrfs_folio_clear_ordered() | |
| Folio 0 still locked, safe | |
| | | Ordered extent already allocated.
| | | Nothing to do.
| | |- extent_writepage_io()
| | |- btrfs_writepage_cow_fixup()
|- btrfs_folio_clear_ordered() | |
Folio 4K hold by thread B, | |
UNSAFE! | |- btrfs_test_ordered()
| | Cleared by thread A,
| |
| |- DEBUG_WARN();
This is only possible after run_delalloc_nocow() failure, as
cow_file_range() will keep all folios and io tree range locked, until
everything is finished or after error handling.
The root cause is we allow fallback_to_cow() and nocow_one_range() to
unlock the folios after a successful run, so that during error handling
we're no longer safe to use btrfs_cleanup_ordered_extents() as the
folios are already unlocked.
[FIX]
- Make fallback_to_cow() and nocow_one_range() to keep folios locked
after a successful run
For fallback_to_cow() we can pass COW_FILE_RANGE_KEEP_LOCKED flag
into cow_file_range().
For nocow_one_range() we have to remove the PAGE_UNLOCK flag from
extent_clear_unlock_delalloc().
- Unlock folios if everything is fine in run_delalloc_nocow()
- Use extent_clear_unlock_delalloc() to handle range [@start,
@cur_offset) inside run_delalloc_nocow()
Since folios are still locked, we do not need
cleanup_dirty_folios() to do the cleanup.
extent_clear_unlock_delalloc() with "PAGE_START_WRITEBACK |
PAGE_END_WRITEBACK" will clear the dirty flags.
- Remove cleanup_dirty_folios()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently if we hit an error inside nocow_one_range(), we do not clear
the page dirty, and let the caller to handle it.
This is very different compared to fallback_to_cow(), when that function
failed, everything will be cleaned up by cow_file_range().
Enhance the situation by:
- Use a common error handling for nocow_one_range()
If we failed anything, use the same btrfs_cleanup_ordered_extents()
and extent_clear_unlock_delalloc().
btrfs_cleanup_ordered_extents() is safe even if we haven't created new
ordered extent, in that case there should be no OE and that function
will do nothing.
The same applies to extent_clear_unlock_delalloc(), and since we're
passing PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK, it
will also clear folio dirty flag during error handling.
- Avoid touching the failed range of nocow_one_range()
As the failed range will be cleaned up and unlocked by that function.
Here we introduce a new variable @nocow_end to record the failed range,
so that we can skip it during the error handling of run_delalloc_nocow().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|