| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Add mechanism for cleaning out unused, stale dentries; controlled via
a module option (Luis Henriques)
- Fix various bugs
- Cleanups
* tag 'fuse-update-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: Uninitialized variable in fuse_epoch_work()
fuse: fix io-uring list corruption for terminated non-committed requests
fuse: signal that a fuse inode should exhibit local fs behaviors
fuse: Always flush the page cache before FOPEN_DIRECT_IO write
fuse: Invalidate the page cache after FOPEN_DIRECT_IO write
fuse: rename 'namelen' to 'namesize'
fuse: use strscpy instead of strcpy
fuse: refactor fuse_conn_put() to remove negative logic.
fuse: new work queue to invalidate dentries from old epochs
fuse: new work queue to periodically invalidate expired dentries
dcache: export shrink_dentry_list() and add new helper d_dispose_if_unused()
fuse: add WARN_ON and comment for RCU revalidate
fuse: Fix whitespace for fuse_uring_args_to_ring() comment
fuse: missing copy_finish in fuse-over-io-uring argument copies
fuse: fix readahead reclaim deadlock
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull persistent dentry infrastructure and conversion from Al Viro:
"Some filesystems use a kinda-sorta controlled dentry refcount leak to
pin dentries of created objects in dcache (and undo it when removing
those). A reference is grabbed and not released, but it's not actually
_stored_ anywhere.
That works, but it's hard to follow and verify; among other things, we
have no way to tell _which_ of the increments is intended to be an
unpaired one. Worse, on removal we need to decide whether the
reference had already been dropped, which can be non-trivial if that
removal is on umount and we need to figure out if this dentry is
pinned due to e.g. unlink() not done. Usually that is handled by using
kill_litter_super() as ->kill_sb(), but there are open-coded special
cases of the same (consider e.g. /proc/self).
Things get simpler if we introduce a new dentry flag
(DCACHE_PERSISTENT) marking those "leaked" dentries. Having it set
claims responsibility for +1 in refcount.
The end result this series is aiming for:
- get these unbalanced dget() and dput() replaced with new primitives
that would, in addition to adjusting refcount, set and clear
persistency flag.
- instead of having kill_litter_super() mess with removing the
remaining "leaked" references (e.g. for all tmpfs files that hadn't
been removed prior to umount), have the regular
shrink_dcache_for_umount() strip DCACHE_PERSISTENT of all dentries,
dropping the corresponding reference if it had been set. After that
kill_litter_super() becomes an equivalent of kill_anon_super().
Doing that in a single step is not feasible - it would affect too many
places in too many filesystems. It has to be split into a series.
This work has really started early in 2024; quite a few preliminary
pieces have already gone into mainline. This chunk is finally getting
to the meat of that stuff - infrastructure and most of the conversions
to it.
Some pieces are still sitting in the local branches, but the bulk of
that stuff is here"
* tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
d_make_discardable(): warn if given a non-persistent dentry
kill securityfs_recursive_remove()
convert securityfs
get rid of kill_litter_super()
convert rust_binderfs
convert nfsctl
convert rpc_pipefs
convert hypfs
hypfs: swich hypfs_create_u64() to returning int
hypfs: switch hypfs_create_str() to returning int
hypfs: don't pin dentries twice
convert gadgetfs
gadgetfs: switch to simple_remove_by_name()
convert functionfs
functionfs: switch to simple_remove_by_name()
functionfs: fix the open/removal races
functionfs: need to cancel ->reset_work in ->kill_sb()
functionfs: don't bother with ffs->ref in ffs_data_{opened,closed}()
functionfs: don't abuse ffs_data_closed() on fs shutdown
convert selinuxfs
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring updates from Jens Axboe:
- Unify how task_work cancelations are detected, placing it in the
task_work running state rather than needing to check the task state
- Series cleaning up and moving the cancelation code to where it
belongs, in cancel.c
- Cleanup of waitid and futex argument handling
- Add support for mixed sized SQEs. 6.18 added support for mixed sized
CQEs, improving flexibility and efficiency of workloads that need big
CQEs. This adds similar support for SQEs, where the occasional need
for a 128b SQE doesn't necessitate having all SQEs be 128b in size
- Introduce zcrx and SQ/CQ layout queries. The former returns what zcrx
features are available. And both return the ring size information to
help with allocation size calculation for user provided rings like
IORING_SETUP_NO_MMAP and IORING_MEM_REGION_TYPE_USER
- Zcrx updates for 6.19. It includes a bunch of small patches,
IORING_REGISTER_ZCRX_CTRL and RQ flushing and David's work on sharing
zcrx b/w multiple io_uring instances
- Series cleaning up ring initializations, notable deduplicating ring
size and offset calculations. It also moves most of the checking
before doing any allocations, making the code simpler
- Add support for getsockname and getpeername, which is mostly a
trivial hookup after a bit of refactoring on the networking side
- Various fixes and cleanups
* tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits)
io_uring: Introduce getsockname io_uring cmd
socket: Split out a getsockname helper for io_uring
socket: Unify getsockname and getpeername implementation
io_uring/query: drop unused io_handle_query_entry() ctx arg
io_uring/kbuf: remove obsolete buf_nr_pages and update comments
io_uring/register: use correct location for io_rings_layout
io_uring/zcrx: share an ifq between rings
io_uring/zcrx: add io_fill_zcrx_offsets()
io_uring/zcrx: export zcrx via a file
io_uring/zcrx: move io_zcrx_scrub() and dependencies up
io_uring/zcrx: count zcrx users
io_uring/zcrx: add sync refill queue flushing
io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL
io_uring/zcrx: elide passing msg flags
io_uring/zcrx: use folio_nr_pages() instead of shift operation
io_uring/zcrx: convert to use netmem_desc
io_uring/query: introduce rings info query
io_uring/query: introduce zcrx query
io_uring: move cq/sq user offset init around
io_uring: pre-calculate scq layout
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull directory locking updates from Christian Brauner:
"This contains the work to add centralized APIs for directory locking
operations.
This series is part of a larger effort to change directory operation
locking to allow multiple concurrent operations in a directory. The
ultimate goal is to lock the target dentry(s) rather than the whole
parent directory.
To help with changing the locking protocol, this series centralizes
locking and lookup in new helper functions. The helpers establish a
pattern where it is the dentry that is being locked and unlocked
(currently the lock is held on dentry->d_parent->d_inode, but that can
change in the future).
This also changes vfs_mkdir() to unlock the parent on failure, as well
as dput()ing the dentry. This allows end_creating() to only require
the target dentry (which may be IS_ERR() after vfs_mkdir()), not the
parent"
* tag 'vfs-6.19-rc1.directory.locking' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
nfsd: fix end_creating() conversion
VFS: introduce end_creating_keep()
VFS: change vfs_mkdir() to unlock on failure.
ecryptfs: use new start_creating/start_removing APIs
Add start_renaming_two_dentries()
VFS/ovl/smb: introduce start_renaming_dentry()
VFS/nfsd/ovl: introduce start_renaming() and end_renaming()
VFS: add start_creating_killable() and start_removing_killable()
VFS: introduce start_removing_dentry()
smb/server: use end_removing_noperm for for target of smb2_create_link()
VFS: introduce start_creating_noperm() and start_removing_noperm()
VFS/nfsd/cachefiles/ovl: introduce start_removing() and end_removing()
VFS/nfsd/cachefiles/ovl: add start_creating() and end_creating()
VFS: tidy up do_unlinkat()
VFS: introduce start_dirop() and end_dirop()
debugfs: rename end_creating() to debugfs_end_creating()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull directory delegations update from Christian Brauner:
"This contains the work for recall-only directory delegations for
knfsd.
Add support for simple, recallable-only directory delegations. This
was decided at the fall NFS Bakeathon where the NFS client and server
maintainers discussed how to merge directory delegation support.
The approach starts with recallable-only delegations for several reasons:
1. RFC8881 has gaps that are being addressed in RFC8881bis. In
particular, it requires directory position information for
CB_NOTIFY callbacks, which is difficult to implement properly
under Linux. The spec is being extended to allow that information
to be omitted.
2. Client-side support for CB_NOTIFY still lags. The client side
involves heuristics about when to request a delegation.
3. Early indication shows simple, recallable-only delegations can
help performance. Anna Schumaker mentioned seeing a multi-minute
speedup in xfstests runs with them enabled.
With these changes, userspace can also request a read lease on a
directory that will be recalled on conflicting accesses. This may be
useful for applications like Samba. Users can disable leases
altogether via the fs.leases-enable sysctl if needed.
VFS changes:
- Dedicated Type for Delegations
Introduce struct delegated_inode to track inodes that may have
delegations that need to be broken. This replaces the previous
approach of passing raw inode pointers through the delegation
breaking code paths, providing better type safety and clearer
semantics for the delegation machinery.
- Break parent directory delegations in open(..., O_CREAT) codepath
- Allow mkdir to wait for delegation break on parent
- Allow rmdir to wait for delegation break on parent
- Add try_break_deleg calls for parents to vfs_link(), vfs_rename(),
and vfs_unlink()
- Make vfs_create(), vfs_mknod(), and vfs_symlink() break delegations
on parent directory
- Clean up argument list for vfs_create()
- Expose delegation support to userland
Filelock changes:
- Make lease_alloc() take a flags argument
- Rework the __break_lease API to use flags
- Add struct delegated_inode
- Push the S_ISREG check down to ->setlease handlers
- Lift the ban on directory leases in generic_setlease
NFSD changes:
- Allow filecache to hold S_IFDIR files
- Allow DELEGRETURN on directories
- Wire up GET_DIR_DELEGATION handling
Fixes:
- Fix kernel-doc warnings in __fcntl_getlease
- Add needed headers for new struct delegation definition"
* tag 'vfs-6.19-rc1.directory.delegations' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
vfs: add needed headers for new struct delegation definition
filelock: __fcntl_getlease: fix kernel-doc warnings
vfs: expose delegation support to userland
nfsd: wire up GET_DIR_DELEGATION handling
nfsd: allow DELEGRETURN on directories
nfsd: allow filecache to hold S_IFDIR files
filelock: lift the ban on directory leases in generic_setlease
vfs: make vfs_symlink break delegations on parent dir
vfs: make vfs_mknod break delegations on parent directory
vfs: make vfs_create break delegations on parent directory
vfs: clean up argument list for vfs_create()
vfs: break parent dir delegations in open(..., O_CREAT) codepath
vfs: allow rmdir to wait for delegation break on parent
vfs: allow mkdir to wait for delegation break on parent
vfs: add try_break_deleg calls for parents to vfs_{link,rename,unlink}
filelock: push the S_ISREG check down to ->setlease handlers
filelock: add struct delegated_inode
filelock: rework the __break_lease API to use flags
filelock: make lease_alloc() take a flags argument
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
"Features:
- Hide inode->i_state behind accessors. Open-coded accesses prevent
asserting they are done correctly. One obvious aspect is locking,
but significantly more can be checked. For example it can be
detected when the code is clearing flags which are already missing,
or is setting flags when it is illegal (e.g., I_FREEING when
->i_count > 0)
- Provide accessors for ->i_state, converts all filesystems using
coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2,
overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to
compile
- Rework I_NEW handling to operate without fences, simplifying the
code after the accessor infrastructure is in place
Cleanups:
- Move wait_on_inode() from writeback.h to fs.h
- Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
for clarity
- Cosmetic fixes to LRU handling
- Push list presence check into inode_io_list_del()
- Touch up predicts in __d_lookup_rcu()
- ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
- Assert on ->i_count in iput_final()
- Assert ->i_lock held in __iget()
Fixes:
- Add missing fences to I_NEW handling"
* tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
dcache: touch up predicts in __d_lookup_rcu()
fs: push list presence check into inode_io_list_del()
fs: cosmetic fixes to lru handling
fs: rework I_NEW handling to operate without fences
fs: make plain ->i_state access fail to compile
xfs: use the new ->i_state accessors
nilfs2: use the new ->i_state accessors
overlayfs: use the new ->i_state accessors
gfs2: use the new ->i_state accessors
f2fs: use the new ->i_state accessors
smb: use the new ->i_state accessors
ceph: use the new ->i_state accessors
btrfs: use the new ->i_state accessors
Manual conversion to use ->i_state accessors of all places not covered by coccinelle
Coccinelle-based conversion to use ->i_state accessors
fs: provide accessors for ->i_state
fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
fs: move wait_on_inode() from writeback.h to fs.h
fs: add missing fences to I_NEW handling
ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull iomap updates from Christian Brauner:
"FUSE iomap Support for Buffered Reads:
This adds iomap support for FUSE buffered reads and readahead. This
enables granular uptodate tracking with large folios so only
non-uptodate portions need to be read. Also fixes a race condition
with large folios + writeback cache that could cause data corruption
on partial writes followed by reads.
- Refactored iomap read/readahead bio logic into helpers
- Added caller-provided callbacks for read operations
- Moved buffered IO bio logic into new file
- FUSE now uses iomap for read_folio and readahead
Zero Range Folio Batch Support:
Add folio batch support for iomap_zero_range() to handle dirty
folios over unwritten mappings. Fix raciness issues where dirty data
could be lost during zero range operations.
- filemap_get_folios_tag_range() helper for dirty folio lookup
- Optional zero range dirty folio processing
- XFS fills dirty folios on zero range of unwritten mappings
- Removed old partial EOF zeroing optimization
DIO Write Completions from Interrupt Context:
Restore pre-iomap behavior where pure overwrite completions run
inline rather than being deferred to workqueue. Reduces context
switches for high-performance workloads like ScyllaDB.
- Removed unused IOCB_DIO_CALLER_COMP code
- Error completions always run in user context (fixes zonefs)
- Reworked REQ_FUA selection logic
- Inverted IOMAP_DIO_INLINE_COMP to IOMAP_DIO_OFFLOAD_COMP
Buffered IO Cleanups:
Some performance and code clarity improvements:
- Replace manual bitmap scanning with find_next_bit()
- Simplify read skip logic for writes
- Optimize pending async writeback accounting
- Better variable naming
- Documentation for iomap_finish_folio_write() requirements
Misaligned Vectors for Zoned XFS:
Enables sub-block aligned vectors in XFS always-COW mode for zoned
devices via new IOMAP_DIO_FSBLOCK_ALIGNED flag.
Bug Fixes:
- Allocate s_dio_done_wq for async reads (fixes syzbot report after
error completion changes)
- Fix iomap_read_end() for already uptodate folios (regression fix)"
* tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (40 commits)
iomap: allocate s_dio_done_wq for async reads as well
iomap: fix iomap_read_end() for already uptodate folios
iomap: invert the polarity of IOMAP_DIO_INLINE_COMP
iomap: support write completions from interrupt context
iomap: rework REQ_FUA selection
iomap: always run error completions in user context
fs, iomap: remove IOCB_DIO_CALLER_COMP
iomap: use find_next_bit() for uptodate bitmap scanning
iomap: use find_next_bit() for dirty bitmap scanning
iomap: simplify when reads can be skipped for writes
iomap: simplify ->read_folio_range() error handling for reads
iomap: optimize pending async writeback accounting
docs: document iomap writeback's iomap_finish_folio_write() requirement
iomap: account for unaligned end offsets when truncating read range
iomap: rename bytes_pending/bytes_accounted to bytes_submitted/bytes_not_submitted
xfs: support sub-block aligned vectors in always COW mode
iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag
xfs: error tag to force zeroing on debug kernels
iomap: remove old partial eof zeroing optimization
xfs: fill dirty folios on zero range of unwritten mappings
...
|
|
The fuse_ilookup() function only sets *fm on the success path so this
"if (fm) {" NULL check doesn't work. The "fm" pointer is either
uninitialized or valid. Check the "inode" pointer instead.
Also, while it's not necessary, it is cleaner to move the iput(inode)
under the NULL check as well.
Fixes: 64becd224ff9 ("fuse: new work queue to invalidate dentries from old epochs")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
When a request is terminated before it has been committed, the request
is not removed from the queue's list. This leaves a dangling list entry
that leads to list corruption and use-after-free issues.
Remove the request from the queue's list for terminated non-committed
requests.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Fixes: c090c8abae4b ("fuse: Add io-uring sqe commit and fetch support")
Cc: stable@vger.kernel.org
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Create a new fuse inode flag that indicates that the kernel should
implement various local filesystem behaviors instead of passing vfs
commands straight through to the fuse server and expecting the server to
do all the work. For example, this means that we'll use the kernel to
transform some ACL updates into mode changes, and later to do
enforcement of the immutable and append iflags.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
objects are created in fuse_ctl_add_dentry() by d_alloc_name()+d_add(),
removed by simple_remove_by_name().
What we return is a borrowed reference - it is valid until the call of
fuse_ctl_remove_conn() and we depend upon the exclusion (on fuse_mutex)
for safety. Return value is used only within the caller
(fuse_ctl_add_conn()).
Replace d_add() with d_make_persistent() + dput(). dput() is paired
with d_alloc_name() and return value is the result of d_make_persistent().
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
simple_recursive_removal(), but instead of victim dentry it takes
parent + name.
Used to be open-coded in fs/fuse/control.c, but there's no need to expose
the guts of that thing there and there are other potential users, so
let's lift it into libfs...
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
fuse_ctl_remove_conn() used to decrement the link count of root
manually; that got subsumed by simple_recursive_removal(), but
in case when subdirectory creation has failed the latter won't
get called.
Just move the modification of parent's link count into
fuse_ctl_add_dentry() to keep the things simple. Allows to
get rid of the nlink argument as well...
Fixes: fcaac5b42768 "fuse_ctl: use simple_recursive_removal()"
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
xfs, fuse, ipc/mqueue need variants of start_creating or start_removing
which do not check permissions.
This patch adds _noperm versions of these functions.
Note that do_mq_open() was only calling mntget() so it could call
path_put() - it didn't really need an extra reference on the mnt.
Now it doesn't call mntget() and uses end_creating() which does
the dput() half of path_put().
Also mq_unlink() previously passed
d_inode(dentry->d_parent)
as the dir inode to vfs_unlink(). This is after locking
d_inode(mnt->mnt_root)
These two inodes are the same, but normally calls use the textual
parent.
So I've changes the vfs_unlink() call to be given d_inode(mnt->mnt_root).
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
--
changes since v2:
- dir arg passed to vfs_unlink() in mq_unlink() changed to match
the dir passed to lookup_noperm()
- restore assignment to path->mnt even though the mntget() is removed.
Link: https://patch.msgid.link/20251113002050.676694-7-neilb@ownmail.net
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This was done as condition on direct_io_allow_mmap, but I believe
this is not right, as a file might be open two times - once with
write-back enabled another time with FOPEN_DIRECT_IO.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
generic_file_direct_write() also does this and has a large
comment about.
Reproducer here is xfstest's generic/209, which is exactly to
have competing DIO write and cached IO read.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
By "length of a string" usually the number of non-null chars is
meant (i.e. strlen(str)). So the variable 'namelen' was confusingly
named, whereas 'namesize' refers more to what's being done in
'get_security_context'.
Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
As pointed out in [1], strcpy() is deprecated in favor of strscpy().
Furthermore, the size of the buffer for the name to be copied is well known
at this point since we are going to move the pointer by that much on the
next line. Hence, it's safe to assume 'namelen' for the size of the string
to be copied.
[1] https://github.com/KSPP/linux/issues/88
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
There is no functional change with this patch. It simply refactors
function fuse_conn_put() to not use negative logic, which makes it more
easier to read.
Signed-off-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
With the infrastructure introduced to periodically invalidate expired
dentries, it is now possible to add an extra work queue to invalidate
dentries when an epoch is incremented. This work queue will only be
triggered when the 'inval_wq' parameter is set.
Signed-off-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This patch adds the necessary infrastructure to keep track of all dentries
created for FUSE file systems. A set of rbtrees, protected by hashed
locks, will be used to keep all these dentries sorted by expiry time.
A new module parameter 'inval_wq' is also added. When set, it will start
a work queue which will periodically invalidate expired dentries. The
value of this new parameter is the period, in seconds, for this work
queue. Once this parameter is set, every new dentry will be added to one
of the rbtrees.
When the work queue is executed, it will check all the rbtrees and will
invalidate those dentries that have timed-out.
The work queue period can not be smaller than 5 seconds, but can be
disabled by setting 'inval_wq' to zero (which is the default).
Signed-off-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The function comment accidentally got wrong indentation.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Fix a possible reference count leak of payload pages during
fuse argument copies.
[Joanne: simplified error cleanup]
Fixes: c090c8abae4b ("fuse: Add io-uring sqe commit and fetch support")
Cc: stable@vger.kernel.org # v6.14
Signed-off-by: Cheng Ding <cding@ddn.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Instead of requiring that the caller calls iomap_finish_folio_read()
even if the ->read_folio_range() callback returns an error, account for
this internally in iomap instead, which makes the interface simpler and
makes it match writeback's ->read_folio_range() error handling
expectations.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-6-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pending writebacks must be accounted for to determine when all requests
have completed and writeback on the folio should be ended. Currently
this is done by atomically incrementing ifs->write_bytes_pending for
every range to be written back.
Instead, the number of atomic operations can be minimized by setting
ifs->write_bytes_pending to the folio size, internally tracking how many
bytes are written back asynchronously, and then after sending off all
the requests, decrementing ifs->write_bytes_pending by the number of
bytes not written back asynchronously. Now, for N ranges written back,
only N + 2 atomic operations are required instead of 2N + 2.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20251111193658.3495942-5-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When nfsd starts requesting directory delegations, setlease handlers may
see requests for leases on directories. Push the !S_ISREG check down
into the non-trivial setlease handlers, so we can selectively enable
them where they're supported.
FUSE is special: It's the only filesystem that supports atomic_open and
allows kernel-internal leases. atomic_open is issued when the VFS
doesn't know the state of the dentry being opened. If the file doesn't
exist, it may be created, in which case the dir lease should be broken.
The existing kernel-internal lease implementation has no provision for
this. Ensure that we don't allow directory leases by default going
forward by explicitly disabling them there.
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-4-52f3feebb2f2@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Commit e26ee4efbc79 ("fuse: allocate ff->release_args only if release is
needed") skips allocating ff->release_args if the server does not
implement open. However in doing so, fuse_prepare_release() now skips
grabbing the reference on the inode, which makes it possible for an
inode to be evicted from the dcache while there are inflight readahead
requests. This causes a deadlock if the server triggers reclaim while
servicing the readahead request and reclaim attempts to evict the inode
of the file being read ahead. Since the folio is locked during
readahead, when reclaim evicts the fuse inode and fuse_evict_inode()
attempts to remove all folios associated with the inode from the page
cache (truncate_inode_pages_range()), reclaim will block forever waiting
for the lock since readahead cannot relinquish the lock because it is
itself blocked in reclaim:
>>> stack_trace(1504735)
folio_wait_bit_common (mm/filemap.c:1308:4)
folio_lock (./include/linux/pagemap.h:1052:3)
truncate_inode_pages_range (mm/truncate.c:336:10)
fuse_evict_inode (fs/fuse/inode.c:161:2)
evict (fs/inode.c:704:3)
dentry_unlink_inode (fs/dcache.c:412:3)
__dentry_kill (fs/dcache.c:615:3)
shrink_kill (fs/dcache.c:1060:12)
shrink_dentry_list (fs/dcache.c:1087:3)
prune_dcache_sb (fs/dcache.c:1168:2)
super_cache_scan (fs/super.c:221:10)
do_shrink_slab (mm/shrinker.c:435:9)
shrink_slab (mm/shrinker.c:626:10)
shrink_node (mm/vmscan.c:5951:2)
shrink_zones (mm/vmscan.c:6195:3)
do_try_to_free_pages (mm/vmscan.c:6257:3)
do_swap_page (mm/memory.c:4136:11)
handle_pte_fault (mm/memory.c:5562:10)
handle_mm_fault (mm/memory.c:5870:9)
do_user_addr_fault (arch/x86/mm/fault.c:1338:10)
handle_page_fault (arch/x86/mm/fault.c:1481:3)
exc_page_fault (arch/x86/mm/fault.c:1539:2)
asm_exc_page_fault+0x22/0x27
Fix this deadlock by allocating ff->release_args and grabbing the
reference on the inode when preparing the file for release even if the
server does not implement open. The inode reference will be dropped when
the last reference on the fuse file is dropped (see fuse_file_put() ->
fuse_release_end()).
Fixes: e26ee4efbc79 ("fuse: allocate ff->release_args only if release is needed")
Cc: stable@vger.kernel.org
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reported-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
In virtio_fs_add_queues_sysfs(), the code incorrectly checks fs->mqs_kobj
after calling kobject_create_and_add(). Change the check to fsvq->kobj
(fs->mqs_kobj -> fsvq->kobj) to ensure the per-queue kobject is
successfully created.
Fixes: 87cbdc396a31 ("virtio_fs: add sysfs entries for queue information")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Link: https://patch.msgid.link/20251027104658.1668537-1-alok.a.tiwari@oracle.com
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Now that fuse is integrated with iomap for read/readahead, we can remove
the workaround that was added in commit bd24d2108e9c ("fuse: fix fuseblk
i_blkbits for iomap partial writes"), which was previously needed to
avoid a race condition where an iomap partial write may be overwritten
by a read if blocksize < PAGE_SIZE. Now that fuse does iomap
read/readahead, this is protected against since there is granular
uptodate tracking of blocks, which means this workaround can be removed.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Do readahead in fuse using iomap. This gives us granular uptodate
tracking for large folios, which optimizes how much data needs to be
read in. If some portions of the folio are already uptodate (eg through
a prior write), we only need to read in the non-uptodate portions.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Read folio data into the page cache using iomap. This gives us granular
uptodate tracking for large folios, which optimizes how much data needs
to be read in. If some portions of the folio are already uptodate (eg
through a prior write), we only need to read in the non-uptodate
portions.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
io_uring task work dispatch makes an indirect call to struct io_kiocb's
io_task_work.func field to allow running arbitrary task work functions.
In the uring_cmd case, this calls io_uring_cmd_work(), which immediately
makes another indirect call to struct io_uring_cmd's task_work_cb field.
Change the uring_cmd task work callbacks to functions whose signatures
match io_req_tw_func_t. Add a function io_uring_cmd_from_tw() to convert
from the task work's struct io_tw_req argument to struct io_uring_cmd *.
Define a constant IO_URING_CMD_TASK_WORK_ISSUE_FLAGS to avoid
manufacturing issue_flags in the uring_cmd task work callbacks. Now
uring_cmd task work dispatch makes a single indirect call to the
uring_cmd implementation's callback. This also allows removing the
task_work_cb field from struct io_uring_cmd, freeing up 8 bytes for
future storage.
Since fuse_uring_send_in_task() now has access to the io_tw_token_t,
check its cancel field directly instead of relying on the
IO_URING_F_TASK_DEAD issue flag.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
All places were patched by coccinelle with the default expecting that
->i_lock is held, afterwards entries got fixed up by hand to use
unlocked variants as needed.
The script:
@@
expression inode, flags;
@@
- inode->i_state & flags
+ inode_state_read(inode) & flags
@@
expression inode, flags;
@@
- inode->i_state &= ~flags
+ inode_state_clear(inode, flags)
@@
expression inode, flag1, flag2;
@@
- inode->i_state &= ~flag1 & ~flag2
+ inode_state_clear(inode, flag1 | flag2)
@@
expression inode, flags;
@@
- inode->i_state |= flags
+ inode_state_set(inode, flags)
@@
expression inode, flags;
@@
- inode->i_state = flags
+ inode_state_assign(inode, flags)
@@
expression inode, flags;
@@
- flags = inode->i_state
+ flags = inode_state_read(inode)
@@
expression inode, flags;
@@
- READ_ONCE(inode->i_state) & flags
+ inode_state_read(inode) & flags
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This reverts commit 474b155adf3927d2c944423045757b54aa1ca4de.
This patch caused regression in ioctl_setflags(). Underlying filesystems
use EOPNOTSUPP to indicate that flag is not supported. This error is
also gets converted in ioctl_setflags(). Therefore, for unsupported
flags error changed from EOPNOSUPP to ENOIOCTLCMD.
Link: https://lore.kernel.org/linux-xfs/a622643f-1585-40b0-9441-cf7ece176e83@kernel.org/
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Extend copy_file_range interface to be fully 64bit capable (Miklos)
- Add selftest for fusectl (Chen Linxuan)
- Move fuse docs into a separate directory (Bagas Sanjaya)
- Allow fuse to enter freezable state in some cases (Sergey
Senozhatsky)
- Clean up writeback accounting after removing tmp page copies (Joanne)
- Optimize virtiofs request handling (Li RongQing)
- Add synchronous FUSE_INIT support (Miklos)
- Allow server to request prune of unused inodes (Miklos)
- Fix deadlock with AIO/sync release (Darrick)
- Add some prep patches for block/iomap support (Darrick)
- Misc fixes and cleanups
* tag 'fuse-update-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (26 commits)
fuse: move CREATE_TRACE_POINTS to a separate file
fuse: move the backing file idr and code into a new source file
fuse: enable FUSE_SYNCFS for all fuseblk servers
fuse: capture the unique id of fuse commands being sent
fuse: fix livelock in synchronous file put from fuseblk workers
mm: fix lockdep issues in writeback handling
fuse: add prune notification
fuse: remove redundant calls to fuse_copy_finish() in fuse_notify()
fuse: fix possibly missing fuse_copy_finish() call in fuse_notify()
fuse: remove FUSE_NOTIFY_CODE_MAX from <uapi/linux/fuse.h>
fuse: remove fuse_readpages_end() null mapping check
fuse: fix references to fuse.rst -> fuse/fuse.rst
fuse: allow synchronous FUSE_INIT
fuse: zero initialize inode private data
fuse: remove unused 'inode' parameter in fuse_passthrough_open
virtio_fs: fix the hash table using in virtio_fs_enqueue_req()
mm: remove BDI_CAP_WRITEBACK_ACCT
fuse: use default writeback accounting
virtio_fs: Remove redundant spinlock in virtio_fs_request_complete()
fuse: remove unneeded offset assignment when filling write pages
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull finish_no_open updates from Al Viro:
"finish_no_open calling conventions change to simplify callers"
* tag 'pull-finish_no_open' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
slightly simplify nfs_atomic_open()
simplify gfs2_atomic_open()
simplify fuse_atomic_open()
simplify nfs_atomic_open_v23()
simplify vboxsf_dir_atomic_open()
simplify cifs_atomic_open()
9p: simplify v9fs_vfs_atomic_open_dotl()
9p: simplify v9fs_vfs_atomic_open()
allow finish_no_open(file, ERR_PTR(-E...))
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "mm, swap: improve cluster scan strategy" from Kairui Song improves
performance and reduces the failure rate of swap cluster allocation
- "support large align and nid in Rust allocators" from Vitaly Wool
permits Rust allocators to set NUMA node and large alignment when
perforning slub and vmalloc reallocs
- "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
DAMOS_STAT's handling of the DAMON operations sets for virtual
address spaces for ops-level DAMOS filters
- "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
Baghdasaryan reduces mmap_lock contention during reads of
/proc/pid/maps
- "mm/mincore: minor clean up for swap cache checking" from Kairui Song
performs some cleanup in the swap code
- "mm: vm_normal_page*() improvements" from David Hildenbrand provides
code cleanup in the pagemap code
- "add persistent huge zero folio support" from Pankaj Raghav provides
a block layer speedup by optionalls making the
huge_zero_pagepersistent, instead of releasing it when its refcount
falls to zero
- "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
the recently added Kexec Handover feature
- "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
Stoakes turns mm_struct.flags into a bitmap. To end the constant
struggle with space shortage on 32-bit conflicting with 64-bit's
needs
- "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
code
- "selftests/mm: Fix false positives and skip unsupported tests" from
Donet Tom fixes a few things in our selftests code
- "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
from David Hildenbrand "allows individual processes to opt-out of
THP=always into THP=madvise, without affecting other workloads on the
system".
It's a long story - the [1/N] changelog spells out the considerations
- "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
the memdesc project. Please see
https://kernelnewbies.org/MatthewWilcox/Memdescs and
https://blogs.oracle.com/linux/post/introducing-memdesc
- "Tiny optimization for large read operations" from Chi Zhiling
improves the efficiency of the pagecache read path
- "Better split_huge_page_test result check" from Zi Yan improves our
folio splitting selftest code
- "test that rmap behaves as expected" from Wei Yang adds some rmap
selftests
- "remove write_cache_pages()" from Christoph Hellwig removes that
function and converts its two remaining callers
- "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
selftests issues
- "introduce kernel file mapped folios" from Boris Burkov introduces
the concept of "kernel file pages". Using these permits btrfs to
account its metadata pages to the root cgroup, rather than to the
cgroups of random inappropriate tasks
- "mm/pageblock: improve readability of some pageblock handling" from
Wei Yang provides some readability improvements to the page allocator
code
- "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
to understand arm32 highmem
- "tools: testing: Use existing atomic.h for vma/maple tests" from
Brendan Jackman performs some code cleanups and deduplication under
tools/testing/
- "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
a couple of 32-bit issues in tools/testing/radix-tree.c
- "kasan: unify kasan_enabled() and remove arch-specific
implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
initialization code into a common arch-neutral implementation
- "mm: remove zpool" from Johannes Weiner removes zspool - an
indirection layer which now only redirects to a single thing
(zsmalloc)
- "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
couple of cleanups in the fork code
- "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
adjustments at various nth_page() callsites, eventually permitting
the removal of that undesirable helper function
- "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
creates a KASAN read-only mode for ARM, using that architecture's
memory tagging feature. It is felt that a read-only mode KASAN is
suitable for use in production systems rather than debug-only
- "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
some tidying in the hugetlb folio allocation code
- "mm: establish const-correctness for pointer parameters" from Max
Kellermann makes quite a number of the MM API functions more accurate
about the constness of their arguments. This was getting in the way
of subsystems (in this case CEPH) when they attempt to improving
their own const/non-const accuracy
- "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
code sites which were confused over when to use free_pages() vs
__free_pages()
- "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
mapletree code accessible to Rust. Required by nouveau and by its
forthcoming successor: the new Rust Nova driver
- "selftests/mm: split_huge_page_test: split_pte_mapped_thp
improvements" from David Hildenbrand adds a fix and some cleanups to
the thp selftesting code
- "mm, swap: introduce swap table as swap cache (phase I)" from Chris
Li and Kairui Song is the first step along the path to implementing
"swap tables" - a new approach to swap allocation and state tracking
which is expected to yield speed and space improvements. This
patchset itself yields a 5-20% performance benefit in some situations
- "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
layer to clean up the ptdesc code a little
- "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
issues in our 5-level pagetable selftesting code
- "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
addresses a couple of minor issues in relatively new memory
allocation profiling feature
- "Small cleanups" from Matthew Wilcox has a few cleanups in
preparation for more memdesc work
- "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
Quanmin Yan makes some changes to DAMON in furtherance of supporting
arm highmem
- "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
Anjum adds that compiler check to selftests code and fixes the
fallout, by removing dead code
- "Improvements to Victim Process Thawing and OOM Reaper Traversal
Order" from zhongjinji makes a number of improvements in the OOM
killer: mainly thawing a more appropriate group of victim threads so
they can release resources
- "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
is a bunch of small and unrelated fixups for DAMON
- "mm/damon: define and use DAMON initialization check function" from
SeongJae Park implement reliability and maintainability improvements
to a recently-added bug fix
- "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
SeongJae Park provides additional transparency to userspace clients
of the DAMON_STAT information
- "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
some constraints on khubepaged's collapsing of anon VMAs. It also
increases the success rate of MADV_COLLAPSE against an anon vma
- "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
from Lorenzo Stoakes moves us further towards removal of
file_operations.mmap(). This patchset concentrates upon clearing up
the treatment of stacked filesystems
- "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
provides some fixes and improvements to mlock's tracking of large
folios. /proc/meminfo's "Mlocked" field became more accurate
- "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
Donet Tom fixes several user-visible KSM stats inaccuracies across
forks and adds selftest code to verify these counters
- "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
some potential but presently benign issues in KSM's mm_slot handling
* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
mm: swap: check for stable address space before operating on the VMA
mm: convert folio_page() back to a macro
mm/khugepaged: use start_addr/addr for improved readability
hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
alloc_tag: fix boot failure due to NULL pointer dereference
mm: silence data-race in update_hiwater_rss
mm/memory-failure: don't select MEMORY_ISOLATION
mm/khugepaged: remove definition of struct khugepaged_mm_slot
mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
hugetlb: increase number of reserving hugepages via cmdline
selftests/mm: add fork inheritance test for ksm_merging_pages counter
mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
drivers/base/node: fix double free in register_one_node()
mm: remove PMD alignment constraint in execmem_vmalloc()
mm/memory_hotplug: fix typo 'esecially' -> 'especially'
mm/rmap: improve mlock tracking for large folios
mm/filemap: map entire large folio faultaround
mm/fault: try to map the entire file folio in finish_fault()
mm/rmap: mlock large folios in try_to_unmap_one()
mm/rmap: fix a mlock race condition in folio_referenced_one()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring updates from Jens Axboe:
- Store ring provided buffers locally for the users, rather than stuff
them into struct io_kiocb.
These types of buffers must always be fully consumed or recycled in
the current context, and leaving them in struct io_kiocb is hence not
a good ideas as that struct has a vastly different life time.
Basically just an architecture cleanup that can help prevent issues
with ring provided buffers in the future.
- Support for mixed CQE sizes in the same ring.
Before this change, a CQ ring either used the default 16b CQEs, or it
was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where
a few 32b CQEs were needed, this caused everything else to use big
CQEs. This is wasteful both in terms of memory usage, but also memory
bandwidth for the posted CQEs.
With IORING_SETUP_CQE_MIXED, applications may use request types that
post both normal 16b and big 32b CQEs on the same ring.
- Add helpers for async data management, to make it harder for opcode
handlers to mess it up.
- Add support for multishot for uring_cmd, which ublk can use. This
helps improve efficiency, by providing a persistent request type that
can trigger multiple CQEs.
- Add initial support for ring feature querying.
We had basic support for probe operations, but the API isn't great.
Rather than expand that, add support for QUERY which is easily
expandable and can cover a lot more cases than the existing probe
support. This will help applications get a better idea of what
operations are supported on a given host.
- zcrx improvements from Pavel:
- Improve refill entry alignment for better caching
- Various cleanups, especially around deduplicating normal
memory vs dmabuf setup.
- Generalisation of the niov size (Patch 12). It's still hard
coded to PAGE_SIZE on init, but will let the user to specify
the rx buffer length on setup.
- Syscall / synchronous bufer return. It'll be used as a slow
fallback path for returning buffers when the refill queue is
full. Useful for tolerating slight queue size misconfiguration
or with inconsistent load.
- Accounting more memory to cgroups.
- Additional independent cleanups that will also be useful for
mutli-area support.
- Various fixes and cleanups
* tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits)
io_uring/cmd: drop unused res2 param from io_uring_cmd_done()
io_uring: fix nvme's 32b cqes on mixed cq
io_uring/query: cap number of queries
io_uring/query: prevent infinite loops
io_uring/zcrx: account niov arrays to cgroup
io_uring/zcrx: allow synchronous buffer return
io_uring/zcrx: introduce io_parse_rqe()
io_uring/zcrx: don't adjust free cache space
io_uring/zcrx: use guards for the refill lock
io_uring/zcrx: reduce netmem scope in refill
io_uring/zcrx: protect netdev with pp_lock
io_uring/zcrx: rename dma lock
io_uring/zcrx: make niov size variable
io_uring/zcrx: set sgt for umem area
io_uring/zcrx: remove dmabuf_offset
io_uring/zcrx: deduplicate area mapping
io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback()
io_uring/zcrx: check all niovs filled with dma addresses
io_uring/zcrx: move area reg checks into io_import_area
io_uring/zcrx: don't pass slot to io_zcrx_create_area
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs workqueue updates from Christian Brauner:
"This contains various workqueue changes affecting the filesystem
layer.
Currently if a user enqueue a work item using schedule_delayed_work()
the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies
to schedule_work() that is using system_wq and queue_work(), that
makes use again of WORK_CPU_UNBOUND.
This replaces the use of system_wq and system_unbound_wq. system_wq is
a per-CPU workqueue which isn't very obvious from the name and
system_unbound_wq is to be used when locality is not required.
So this renames system_wq to system_percpu_wq, and system_unbound_wq
to system_dfl_wq.
This also adds a new WQ_PERCPU flag to allow the fs subsystem users to
explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and
WQ_PERCPU flags coexist for one release cycle to allow callers to
transition their calls. WQ_UNBOUND will be removed in a next release
cycle"
* tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: WQ_PERCPU added to alloc_workqueue users
fs: replace use of system_wq with system_percpu_wq
fs: replace use of system_unbound_wq with system_dfl_wq
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual selections of misc updates for this cycle.
Features:
- Add "initramfs_options" parameter to set initramfs mount options.
This allows to add specific mount options to the rootfs to e.g.,
limit the memory size
- Add RWF_NOSIGNAL flag for pwritev2()
Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE
signal from being raised when writing on disconnected pipes or
sockets. The flag is handled directly by the pipe filesystem and
converted to the existing MSG_NOSIGNAL flag for sockets
- Allow to pass pid namespace as procfs mount option
Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs
mount is auto-selected based on the mounting process's active
pidns, and the pidns itself is basically hidden once the mount has
been constructed)
This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns
of a procfs mount as desired. Examples include:
* In order to bypass the mnt_too_revealing() check, Kubernetes
creates a procfs mount from an empty pidns so that user
namespaced containers can be nested (without this, the nested
containers would fail to mount procfs)
But this requires forking off a helper process because you cannot
just one-shot this using mount(2)
* Container runtimes in general need to fork into a container
before configuring its mounts, which can lead to security issues
in the case of shared-pidns containers (a privileged process in
the pidns can interact with your container runtime process)
While SUID_DUMP_DISABLE and user namespaces make this less of an
issue, the strict need for this due to a minor uAPI wart is kind
of unfortunate
Things would be much easier if there was a way for userspace to
just specify the pidns they want. So this pull request contains
changes to implement a new "pidns" argument which can be set
using fsconfig(2):
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
Cleanups:
- Remove the last references to EXPORT_OP_ASYNC_LOCK
- Make file_remove_privs_flags() static
- Remove redundant __GFP_NOWARN when GFP_NOWAIT is used
- Use try_cmpxchg() in start_dir_add()
- Use try_cmpxchg() in sb_init_done_wq()
- Replace offsetof() with struct_size() in ioctl_file_dedupe_range()
- Remove vfs_ioctl() export
- Replace rwlock() with spinlock in epoll code as rwlock causes
priority inversion on preempt rt kernels
- Make ns_entries in fs/proc/namespaces const
- Use a switch() statement() in init_special_inode() just like we do
in may_open()
- Use struct_size() in dir_add() in the initramfs code
- Use str_plural() in rd_load_image()
- Replace strcpy() with strscpy() in find_link()
- Rename generic_delete_inode() to inode_just_drop() and
generic_drop_inode() to inode_generic_drop()
- Remove unused arguments from fcntl_{g,s}et_rw_hint()
Fixes:
- Document @name parameter for name_contains_dotdot() helper
- Fix spelling mistake
- Always return zero from replace_fd() instead of the file descriptor
number
- Limit the size for copy_file_range() in compat mode to prevent a
signed overflow
- Fix debugfs mount options not being applied
- Verify the inode mode when loading it from disk in minixfs
- Verify the inode mode when loading it from disk in cramfs
- Don't trigger automounts with RESOLVE_NO_XDEV
If openat2() was called with RESOLVE_NO_XDEV it didn't traverse
through automounts, but could still trigger them
- Add FL_RECLAIM flag to show_fl_flags() macro so it appears in
tracepoints
- Fix unused variable warning in rd_load_image() on s390
- Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD
- Use ns_capable_noaudit() when determining net sysctl permissions
- Don't call path_put() under namespace semaphore in listmount() and
statmount()"
* tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits)
fcntl: trim arguments
listmount: don't call path_put() under namespace semaphore
statmount: don't call path_put() under namespace semaphore
pid: use ns_capable_noaudit() when determining net sysctl permissions
fs: rename generic_delete_inode() and generic_drop_inode()
init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD
initramfs: Replace strcpy() with strscpy() in find_link()
initrd: Use str_plural() in rd_load_image()
initramfs: Use struct_size() helper to improve dir_add()
initrd: Fix unused variable warning in rd_load_image() on s390
fs: use the switch statement in init_special_inode()
fs/proc/namespaces: make ns_entries const
filelock: add FL_RECLAIM to show_fl_flags() macro
eventpoll: Replace rwlock with spinlock
selftests/proc: add tests for new pidns APIs
procfs: add "pidns" mount option
pidns: move is-ancestor logic to helper
openat2: don't trigger automounts with RESOLVE_NO_XDEV
namei: move cross-device check to __traverse_mounts
namei: remove LOOKUP_NO_XDEV check from handle_mounts
...
|
|
Before we start adding new tracepoints for fuse+iomap, move the
tracepoint creation itself to a separate source file so that we don't
have to start pulling iomap dependencies into dev.c just for the iomap
structures.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
iomap support for fuse is also going to want the ability to attach
backing files to a fuse filesystem. Move the fuse_backing code into a
separate file so that both can use it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Turn on syncfs for all fuseblk servers so that the ones in the know can
flush cached intermediate data and logs to disk.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The fuse_request_{send,end} tracepoints capture the value of
req->in.h.unique in the trace output. It would be really nice if we
could use this to match a request to its response for debugging and
latency analysis, but the call to trace_fuse_request_send occurs before
the unique id has been set:
fuse_request_send: connection 8388608 req 0 opcode 1 (FUSE_LOOKUP) len 107
fuse_request_end: connection 8388608 req 6 len 16 error -2
(Notice that req moves from 0 to 6)
Move the callsites to trace_fuse_request_send to after the unique id has
been set by introducing a helper to do that for standard fuse_req
requests. FUSE_FORGET requests are not covered by this because they
appear to be synthesized into the event stream without a fuse_req
object and are never replied to.
Requests that are aborted without ever having been submitted to the fuse
server retain the behavior that only the fuse_request_end tracepoint
shows up in the trace record, and with req==0.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
I observed a hang when running generic/323 against a fuseblk server.
This test opens a file, initiates a lot of AIO writes to that file
descriptor, and closes the file descriptor before the writes complete.
Unsurprisingly, the AIO exerciser threads are mostly stuck waiting for
responses from the fuseblk server:
# cat /proc/372265/task/372313/stack
[<0>] request_wait_answer+0x1fe/0x2a0 [fuse]
[<0>] __fuse_simple_request+0xd3/0x2b0 [fuse]
[<0>] fuse_do_getattr+0xfc/0x1f0 [fuse]
[<0>] fuse_file_read_iter+0xbe/0x1c0 [fuse]
[<0>] aio_read+0x130/0x1e0
[<0>] io_submit_one+0x542/0x860
[<0>] __x64_sys_io_submit+0x98/0x1a0
[<0>] do_syscall_64+0x37/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53
But the /weird/ part is that the fuseblk server threads are waiting for
responses from itself:
# cat /proc/372210/task/372232/stack
[<0>] request_wait_answer+0x1fe/0x2a0 [fuse]
[<0>] __fuse_simple_request+0xd3/0x2b0 [fuse]
[<0>] fuse_file_put+0x9a/0xd0 [fuse]
[<0>] fuse_release+0x36/0x50 [fuse]
[<0>] __fput+0xec/0x2b0
[<0>] task_work_run+0x55/0x90
[<0>] syscall_exit_to_user_mode+0xe9/0x100
[<0>] do_syscall_64+0x43/0xf0
[<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53
The fuseblk server is fuse2fs so there's nothing all that exciting in
the server itself. So why is the fuse server calling fuse_file_put?
The commit message for the fstest sheds some light on that:
"By closing the file descriptor before calling io_destroy, you pretty
much guarantee that the last put on the ioctx will be done in interrupt
context (during I/O completion).
Aha. AIO fgets a new struct file from the fd when it queues the ioctx.
The completion of the FUSE_WRITE command from userspace causes the fuse
server to call the AIO completion function. The completion puts the
struct file, queuing a delayed fput to the fuse server task. When the
fuse server task returns to userspace, it has to run the delayed fput,
which in the case of a fuseblk server, it does synchronously.
Sending the FUSE_RELEASE command sychronously from fuse server threads
is a bad idea because a client program can initiate enough simultaneous
AIOs such that all the fuse server threads end up in delayed_fput, and
now there aren't any threads left to handle the queued fuse commands.
Fix this by only using asynchronous fputs when closing files, and leave
a comment explaining why.
Cc: stable@vger.kernel.org # v2.6.38
Fixes: 5a18ec176c934c ("fuse: fix hang of single threaded fuseblk filesystem")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Commit 79525b51acc1 ("io_uring: fix nvme's 32b cqes on mixed cq") split
out a separate io_uring_cmd_done32() helper for ->uring_cmd()
implementations that return 32-byte CQEs. The res2 value passed to
io_uring_cmd_done() is now unused because __io_uring_cmd_done() ignores
it when is_cqe32 is passed as false. So drop the parameter from
io_uring_cmd_done() to simplify the callers and clarify that it's not
possible to return an extra value beyond the 32-bit CQE result.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_wq is a per-CPU worqueue, yet nothing in its name tells about that
CPU affinity constraint, which is very often not required by users.
Make it clear by adding a system_percpu_wq to all the fs subsystem.
The old wq will be kept for a few release cylces.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://lore.kernel.org/20250916082906.77439-3-marco.crivellari@suse.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
generic_delete_inode() is rather misleading for what the routine is
doing. inode_just_drop() should be much clearer.
The new naming is inconsistent with generic_drop_inode(), so rename that
one as well with inode_ as the suffix.
No functional changes.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Patch series "Add and use memdesc_flags_t".
At some point struct page will be separated from struct slab and struct
folio. This is a step towards that by introducing a type for the 'flags'
word of all three structures. This gives us a certain amount of type
safety by establishing that some of these unsigned longs are different
from other unsigned longs in that they contain things like node ID,
section number and zone number in the upper bits. That lets us have
functions that can be easily called by anyone who has a slab, folio or
page (but not easily by anyone else) to get the node or zone.
There's going to be some unusual merge problems with this as some odd bits
of the kernel decide they want to print out the flags value or something
similar by writing page->flags and now they'll need to write page->flags.f
instead. That's most of the churn here. Maybe we should be removing
these things from the debug output?
This patch (of 11):
Wrap the unsigned long flags in a typedef. In upcoming patches, this will
provide a strong hint that you can't just pass a random unsigned long to
functions which take this as an argument.
[willy@infradead.org: s/flags/flags.f/ in several architectures]
Link: https://lkml.kernel.org/r/aKMgPRLD-WnkPxYm@casper.infradead.org
[nicola.vetrini@gmail.com: mips: fix compilation error]
Link: https://lore.kernel.org/lkml/CA+G9fYvkpmqGr6wjBNHY=dRp71PLCoi2341JxOudi60yqaeUdg@mail.gmail.com/
Link: https://lkml.kernel.org/r/20250825214245.1838158-1-nicola.vetrini@gmail.com
Link: https://lkml.kernel.org/r/20250805172307.1302730-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20250805172307.1302730-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|