summaryrefslogtreecommitdiff
path: root/fs/nfsd/filecache.c
AgeCommit message (Collapse)Author
2025-11-12nfsd: allow filecache to hold S_IFDIR filesJeff Layton
The filecache infrastructure will only handle S_IFREG files at the moment. Directory delegations will require adding support for opening S_IFDIR inodes. Plumb a "type" argument into nfsd_file_do_acquire() and have all of the existing callers set it to S_IFREG. Add a new nfsd_file_acquire_dir() wrapper that nfsd can call to request a nfsd_file that holds a directory open. For now, there is no need for a fsnotify_mark for directories, as CB_NOTIFY is not yet supported. Change nfsd_file_do_acquire() to avoid allocating one for non-S_IFREG inodes. Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-14-52f3feebb2f2@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-06Merge tag 'nfsd-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds
Pull nfsd updates from Chuck Lever: "Mike Snitzer has prototyped a mechanism for disabling I/O caching in NFSD. This is introduced in v6.18 as an experimental feature. This enables scaling NFSD in /both/ directions: - NFS service can be supported on systems with small memory footprints, such as low-cost cloud instances - Large NFS workloads will be less likely to force the eviction of server-local activity, helping it avoid thrashing Jeff Layton contributed a number of fixes to the new attribute delegation implementation (based on a pending Internet RFC) that we hope will make attribute delegation reliable enough to enable by default, as it is on the Linux NFS client. The remaining patches in this pull request are clean-ups and minor optimizations. Many thanks to the contributors, reviewers, testers, and bug reporters who participated during the v6.18 NFSD development cycle" * tag 'nfsd-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (42 commits) nfsd: discard nfserr_dropit SUNRPC: Make RPCSEC_GSS_KRB5 select CRYPTO instead of depending on it NFSD: Add io_cache_{read,write} controls to debugfs NFSD: Do the grace period check in ->proc_layoutget nfsd: delete unnecessary NULL check in __fh_verify() NFSD: Allow layoutcommit during grace period NFSD: Disallow layoutget during grace period sunrpc: fix "occurence"->"occurrence" nfsd: Don't force CRYPTO_LIB_SHA256 to be built-in nfsd: nfserr_jukebox in nlm_fopen should lead to a retry NFSD: Reduce DRC bucket size NFSD: Delay adding new entries to LRU SUNRPC: Move the svc_rpcb_cleanup() call sites NFS: Remove rpcbind cleanup for NFSv4.0 callback nfsd: unregister with rpcbind when deleting a transport NFSD: Drop redundant conversion to bool sunrpc: eliminate return pointer in svc_tcp_sendmsg() sunrpc: fix pr_notice in svc_tcp_sendto() to show correct length nfsd: decouple the xprtsec policy check from check_nfsd_access() NFSD: Fix destination buffer size in nfsd4_ssc_setup_dul() ...
2025-10-03Merge tag 'nfs-for-6.18-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client updates from Anna Schumaker: "New Features: - Add a Kconfig option to redirect dfprintk() to the trace buffer - Enable use of the RWF_DONTCACHE flag on the NFS client - Add striped layout handling to pNFS flexfiles - Add proper localio handling for READ and WRITE O_DIRECT Bugfixes: - Handle NFS4ERR_GRACE errors during delegation recall - Fix NFSv4.1 backchannel max_resp_sz verification check - Fix mount hang after CREATE_SESSION failure - Fix d_parent->d_inode locking in nfs4_setup_readdir() Other Cleanups and Improvements: - Improvements to write handling tracepoints - Fix a few trivial spelling mistakes - Cleanups to the rpcbind cleanup call sites - Convert the SUNRPC xdr_buf to use a scratch folio instead of scratch page - Remove unused NFS_WBACK_BUSY() macro - Remove __GFP_NOWARN flags - Unexport rpc_malloc() and rpc_free()" * tag 'nfs-for-6.18-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (46 commits) NFS: add basic STATX_DIOALIGN and STATX_DIO_READ_ALIGN support nfs/localio: add tracepoints for misaligned DIO READ and WRITE support nfs/localio: add proper O_DIRECT support for READ and WRITE nfs/localio: refactor iocb initialization nfs/localio: refactor iocb and iov_iter_bvec initialization nfs/localio: avoid issuing misaligned IO using O_DIRECT nfs/localio: make trace_nfs_local_open_fh more useful NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support sunrpc: unexport rpc_malloc() and rpc_free() NFSv4/flexfiles: Add support for striped layouts NFSv4/flexfiles: Update layout stats & error paths for striped layouts NFSv4/flexfiles: Write path updates for striped layouts NFSv4/flexfiles: Commit path updates for striped layouts NFSv4/flexfiles: Read path updates for striped layouts NFSv4/flexfiles: Update low level helper functions to be DS stripe aware. NFSv4/flexfiles: Add data structure support for striped layouts NFSv4/flexfiles: Use ds_commit_idx when marking a write commit NFSv4/flexfiles: Remove cred local variable dependency nfs4_setup_readdir(): insufficient locking for ->d_parent->d_inode dereferencing NFS: Enable use of the RWF_DONTCACHE flag on the NFS client ...
2025-09-30NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN supportMike Snitzer
Use STATX_DIOALIGN and STATX_DIO_READ_ALIGN to get DIO alignment attributes from the underlying filesystem and store them in the associated nfsd_file. This is done when the nfsd_file is first opened for each regular file. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-09-21nfsd: discard nfsd_file_get_local()NeilBrown
This interface was deprecated by commit e6f7e1487ab5 ("nfs_localio: simplify interface to nfsd for getting nfsd_file") and is now unused. So let's remove it. Signed-off-by: NeilBrown <neil@brown.name> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-09-19fs: replace use of system_unbound_wq with system_dfl_wqMarco Crivellari
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. The old system_unbound_wq will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/20250916082906.77439-2-marco.crivellari@suse.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14NFSD: Clean up kdoc for nfsd_file_put_local()Chuck Lever
Sparse reports that the synopsis of nfsd_file_put_local() does not match its kdoc comment. Introduced by commit c25a89770d1f ("nfs_localio: change nfsd_file_put_local() to take a pointer to __rcu pointer") . Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-28nfs_localio: change nfsd_file_put_local() to take a pointer to __rcu pointerNeilBrown
Instead of calling xchg() and unrcu_pointer() before nfsd_file_put_local(), we now pass pointer to the __rcu pointer and call xchg() and unrcu_pointer() inside that function. Where unrcu_pointer() is currently called the internals of "struct nfsd_file" are not known and that causes older compilers such as gcc-8 to complain. In some cases we have a __kernel (aka normal) pointer not an __rcu pointer so we need to cast it to __rcu first. This is strictly a weakening so no information is lost. Somewhat surprisingly, this cast is accepted by gcc-8. This has the pleasing result that the cmpxchg() which sets ro_file and rw_file, and also the xchg() which clears them, are both now in the nfsd code. Reported-by: Pali Rohár <pali@kernel.org> Reported-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client") Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs_localio: always hold nfsd net ref with nfsd_file refNeilBrown
Having separate nfsd_file_put and nfsd_file_put_local in struct nfsd_localio_operations doesn't make much sense. The difference is that nfsd_file_put doesn't drop a reference to the nfs_net which is what keeps nfsd from shutting down. Currently, if nfsd tries to shutdown it will invalidate the files stored in the list from the nfs_uuid and this will drop all references to the nfsd net that the client holds. But the client could still hold some references to nfsd_files for active IO. So nfsd might think is has completely shut down local IO, but hasn't and has no way to wait for those active IO requests to complete. So this patch changes nfsd_file_get to nfsd_file_get_local and has it increase the ref count on the nfsd net and it replaces all calls to ->nfsd_put_file to ->nfsd_put_file_local. It also changes ->nfsd_open_local_fh to return with the refcount on the net elevated precisely when a valid nfsd_file is returned. This means that whenever the client holds a valid nfsd_file, there will be an associated count on the nfsd net, and so the count can only reach zero when all nfsd_files have been returned. nfs_local_file_put() is changed to call nfs_to_nfsd_file_put_local() instead of replacing calls to one with calls to the other because this will help a later patch which changes nfs_to_nfsd_file_put_local() to take an __rcu pointer while nfs_local_file_put() doesn't. Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client") Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-03-10nfsd: filecache: drop the list_lru lock during lock gc scansNeilBrown
Under a high NFSv3 load with lots of different files being accessed, the LRU list of garbage-collectable files can become quite long. Asking list_lru_scan_node() to scan the whole list can result in a long period during which a spinlock is held, blocking the addition of new LRU items. So ask list_lru_scan_node() to scan only a few entries at a time, and repeat until the scan is complete. If the shrinker runs between two consecutive calls of list_lru_scan_node() it could invalidate the "remaining" counter which could lead to premature freeing. So add a spinlock to avoid that. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: filecache: don't repeatedly add/remove files on the lru listNeilBrown
There is no need to remove a file from the lru every time we access it, and then add it back. It is sufficient to set the REFERENCED flag every time we put the file. The order in the lru of REFERENCED files is largely irrelevant as they will all be moved to the end. With this patch, files are added only when they are allocated (if want_gc) and they are removed only by the list_lru_(shrink_)walk callback or when forcibly removing a file. This should reduce contention on the list_lru spinlock(s) and reduce memory traffic a little. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: filecache: introduce NFSD_FILE_RECENTNeilBrown
The filecache lru is walked in 2 circumstances for 2 different reasons. 1/ When called from the shrinker we want to discard the first few entries on the list, ignoring any with NFSD_FILE_REFERENCED set because they should really be at the end of the LRU as they have been referenced recently. So those ones are ROTATED. 2/ When called from the nfsd_file_gc() timer function we want to discard anything that hasn't been used since before the previous call, and mark everything else as unused at this point in time. Using the same flag for both of these can result in some unexpected outcomes. If the shrinker callback clears NFSD_FILE_REFERENCED then nfsd_file_gc() will think the file hasn't been used in a while, while really it has. I think it is easier to reason about the behaviour if we instead have two flags. NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please put it there when convenient" NFSD_FILE_RECENT means "this has been used recently - since the last run of nfsd_file_gc() When either caller finds an NFSD_FILE_REFERENCED entry, that entry should be moved to the end of the LRU and the flag cleared. This can safely happen at any time. The actual order on the lru might not be strictly least-recently-used, but that is normal for linux lrus. The shrinker callback can ignore the "recent" flag. If it ends up freeing something that is "recent" that simply means that memory pressure is sufficient to limit the acceptable cache age to less than the nfsd_file_gc frequency. The gc callback should primarily focus on NFSD_FILE_RECENT. It should free everything that doesn't have this flag set, and should clear the flag on everything else. When it clears the flag it is convenient to clear the "REFERENCED" flag and move to the end of the LRU too. With this, calls from the shrinker do not prematurely age files. It will focus only on freeing those that are least recently used. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: filecache: use list_lru_walk_node() in nfsd_file_gc()NeilBrown
list_lru_walk() is only useful when the aim is to remove all elements from the list_lru. It will repeatedly visit rotated elements of the first per-node sublist before proceeding to subsequent sublists. This patch changes nfsd_file_gc() to use list_lru_walk_node() and list_lru_count_node() on each NUMA node. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: filecache: use nfsd_file_dispose_list() in nfsd_file_close_inode_sync()NeilBrown
nfsd_file_close_inode_sync() contains an exact copy of nfsd_file_dispose_list(). This patch removes that copy and calls nfsd_file_dispose_list() instead. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10NFSD: Re-organize nfsd_file_gc_worker()Chuck Lever
Dave opines: IMO, there is no need to do this unnecessary work on every object that is added to the LRU. Changing the gc worker to always run every 2s and check if it has work to do like so: static void nfsd_file_gc_worker(struct work_struct *work) { - nfsd_file_gc(); - if (list_lru_count(&nfsd_file_lru)) - nfsd_file_schedule_laundrette(); + if (list_lru_count(&nfsd_file_lru)) + nfsd_file_gc(); + nfsd_file_schedule_laundrette(); } means that nfsd_file_gc() will be run the same way and have the same behaviour as the current code. When the system it idle, it does a list_lru_count() check every 2 seconds and goes back to sleep. That's going to be pretty much unnoticable on most machines that run NFS servers. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: filecache: remove race handling.NeilBrown
The race that this code tries to protect against is not interesting. The code is problematic as we access the "nf" after we have given our reference to the lru system. While that takes 2+ seconds to free things, it is still poor form. The only interesting race I can find would be with nfsd_file_close_inode_sync(); This is the only place that really doesn't want the file to stay on the LRU when unhashed (which is the direct consequence of the race). However for the race to happen, some other thread must own a reference to a file and be putting it while nfsd_file_close_inode_sync() is trying to close all files for an inode. If this is possible, that other thread could simply call nfsd_file_put() a little bit later and the result would be the same: not all files are closed when nfsd_file_close_inode_sync() completes. If this was really a problem, we would need to wait in close_inode_sync for the other references to be dropped. We probably don't want to do that. So it is best to simply remove this code. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-02-10Merge tag 'nfsd-6.14-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: "Fixes for new bugs: - A fix for CB_GETATTR reply decoding was not quite correct - Fix the NFSD connection limiting logic - Fix a bug in the new session table resizing logic Bugs that pre-date v6.14: - Support for courteous clients (5.19) introduced a shutdown hang - Fix a crash in the filecache laundrette (6.9) - Fix a zero-day crash in NFSD's NFSv3 ACL implementation" * tag 'nfsd-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Fix CB_GETATTR status fix NFSD: fix hang in nfsd4_shutdown_callback nfsd: fix __fh_verify for localio nfsd: fix uninitialised slot info when a request is retried nfsd: validate the nfsd_serv pointer before calling svc_wake_up nfsd: clear acl_access/acl_default after releasing them
2025-02-02nfsd: validate the nfsd_serv pointer before calling svc_wake_upJeff Layton
nfsd_file_dispose_list_delayed can be called from the filecache laundrette, which is shut down after the nfsd threads are shut down and the nfsd_serv pointer is cleared. If nn->nfsd_serv is NULL then there are no threads to wake. Ensure that the nn->nfsd_serv pointer is non-NULL before calling svc_wake_up in nfsd_file_dispose_list_delayed. This is safe since the svc_serv is not freed until after the filecache laundrette is cancelled. Reported-by: Salvatore Bonaccorso <carnil@debian.org> Closes: https://bugs.debian.org/1093734 Fixes: ffb402596147 ("nfsd: Don't leave work of closing files to a work queue") Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-01-28Merge tag 'nfs-for-6.14-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client updates from Anna Schumaker: "New Features: - Enable using direct IO with localio - Added localio related tracepoints Bugfixes: - Sunrpc fixes for working with a very large cl_tasks list - Fix a possible buffer overflow in nfs_sysfs_link_rpc_client() - Fixes for handling reconnections with localio - Fix how the NFS_FSCACHE kconfig option interacts with NETFS_SUPPORT - Fix COPY_NOTIFY xdr_buf size calculations - pNFS/Flexfiles fix for retrying requesting a layout segment for reads - Sunrpc fix for retrying on EKEYEXPIRED error when the TGT is expired Cleanups: - Various other nfs & nfsd localio cleanups - Prepratory patches for async copy improvements that are under development - Make OFFLOAD_CANCEL, LAYOUTSTATS, and LAYOUTERR moveable to other xprts - Add netns inum and srcaddr to debugfs rpc_xprt info" * tag 'nfs-for-6.14-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (28 commits) SUNRPC: do not retry on EKEYEXPIRED when user TGT ticket expired sunrpc: add netns inum and srcaddr to debugfs rpc_xprt info pnfs/flexfiles: retry getting layout segment for reads NFSv4.2: make LAYOUTSTATS and LAYOUTERROR MOVEABLE NFSv4.2: mark OFFLOAD_CANCEL MOVEABLE NFSv4.2: fix COPY_NOTIFY xdr buf size calculation NFS: Rename struct nfs4_offloadcancel_data NFS: Fix typo in OFFLOAD_CANCEL comment NFS: CB_OFFLOAD can return NFS4ERR_DELAY nfs: Make NFS_FSCACHE select NETFS_SUPPORT instead of depending on it nfs: fix incorrect error handling in LOCALIO nfs: probe for LOCALIO when v3 client reconnects to server nfs: probe for LOCALIO when v4 client reconnects to server nfs/localio: remove redundant code and simplify LOCALIO enablement nfs_common: add nfs_localio trace events nfs_common: track all open nfsd_files per LOCALIO nfs_client nfs_common: rename nfslocalio nfs_uuid_lock to nfs_uuids_lock nfsd: nfsd_file_acquire_local no longer returns GC'd nfsd_file nfsd: rename nfsd_serv_ prefixed methods and variables with nfsd_net_ nfsd: update percpu_ref to manage references on nfsd_net ...
2025-01-14nfs_common: track all open nfsd_files per LOCALIO nfs_clientMike Snitzer
This tracking enables __nfsd_file_cache_purge() to call nfs_localio_invalidate_clients(), upon shutdown or export change, to nfs_close_local_fh() all open nfsd_files that are still cached by the LOCALIO nfs clients associated with nfsd_net that is being shutdown. Now that the client must track all open nfsd_files there was more work than necessary being done with the global nfs_uuids_lock contended. This manifested in various RCU issues, e.g.: hrtimer: interrupt took 47969440 ns rcu: INFO: rcu_sched detected stalls on CPUs/tasks: Use nfs_uuid->lock to protect all nfs_uuid_t members, instead of nfs_uuids_lock, once nfs_uuid_is_local() adds the client to nn->local_clients. Also add 'local_clients_lock' to 'struct nfsd_net' to protect nn->local_clients. And store a pointer to spinlock in the 'list_lock' member of nfs_uuid_t so nfs_localio_disable_client() can use it to avoid taking the global nfs_uuids_lock. In combination, these split out locks eliminate the use of the single nfslocalio.c global nfs_uuids_lock in the IO paths (open and close). Also refactored associated fs/nfs_common/nfslocalio.c methods' locking to reduce work performed with spinlocks held in general. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-01-14nfsd: nfsd_file_acquire_local no longer returns GC'd nfsd_fileMike Snitzer
Now that LOCALIO no longer leans on NFSD's filecache for caching open files (and instead uses NFS client-side open nfsd_file caching) there is no need to use NFSD filecache's GC feature. Avoiding GC will speed up nfsd_file initial opens. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-01-14nfsd: rename nfsd_serv_ prefixed methods and variables with nfsd_net_Mike Snitzer
Also update Documentation/filesystems/nfs/localio.rst accordingly and reduce the technical documentation debt that was previously captured in that document. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-12-02tree-wide: s/revert_creds_light()/revert_creds()/gChristian Brauner
Rename all calls to revert_creds_light() back to revert_creds(). Link: https://lore.kernel.org/r/20241125-work-cred-v2-6-68b9d38bb5b2@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02tree-wide: s/revert_creds()/put_cred(revert_creds_light())/gChristian Brauner
Convert all calls to revert_creds() over to explicitly dropping reference counts in preparation for converting revert_creds() to revert_creds_light() semantics. Link: https://lore.kernel.org/r/20241125-work-cred-v2-3-68b9d38bb5b2@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-26Merge tag 'nfsd-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds
Pull nfsd updates from Chuck Lever: "Jeff Layton contributed a scalability improvement to NFSD's NFSv4 backchannel session implementation. This improvement is intended to increase the rate at which NFSD can safely recall NFSv4 delegations from clients, to avoid the need to revoke them. Revoking requires a slow state recovery process. A wide variety of bug fixes and other incremental improvements make up the bulk of commits in this series. As always I am grateful to the NFSD contributors, reviewers, testers, and bug reporters who participated during this cycle" * tag 'nfsd-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (72 commits) nfsd: allow for up to 32 callback session slots nfs_common: must not hold RCU while calling nfsd_file_put_local nfsd: get rid of include ../internal.h nfsd: fix nfs4_openowner leak when concurrent nfsd4_open occur NFSD: Add nfsd4_copy time-to-live NFSD: Add a laundromat reaper for async copy state NFSD: Block DESTROY_CLIENTID only when there are ongoing async COPY operations NFSD: Handle an NFS4ERR_DELAY response to CB_OFFLOAD NFSD: Free async copy information in nfsd4_cb_offload_release() NFSD: Fix nfsd4_shutdown_copy() NFSD: Add a tracepoint to record canceled async COPY operations nfsd: make nfsd4_session->se_flags a bool nfsd: remove nfsd4_session->se_bchannel nfsd: make use of warning provided by refcount_t nfsd: Don't fail OP_SETCLIENTID when there are too many clients. svcrdma: fix miss destroy percpu_counter in svc_rdma_proc_init() xdrgen: Remove program_stat_to_errno() call sites xdrgen: Update the files included in client-side source code xdrgen: Remove check for "nfs_ok" in C templates xdrgen: Remove tracepoint call site ...
2024-11-18nfs_common: must not hold RCU while calling nfsd_file_put_localMike Snitzer
Move holding the RCU from nfs_to_nfsd_file_put_local to nfs_to_nfsd_net_put. It is the call to nfs_to->nfsd_serv_put that requires the RCU anyway (the puts for nfsd_file and netns were combined to avoid an extra indirect reference but that micro-optimization isn't possible now). This fixes xfstests generic/013 and it triggering: "Voluntary context switch within RCU read-side critical section!" [ 143.545738] Call Trace: [ 143.546206] <TASK> [ 143.546625] ? show_regs+0x6d/0x80 [ 143.547267] ? __warn+0x91/0x140 [ 143.547951] ? rcu_note_context_switch+0x496/0x5d0 [ 143.548856] ? report_bug+0x193/0x1a0 [ 143.549557] ? handle_bug+0x63/0xa0 [ 143.550214] ? exc_invalid_op+0x1d/0x80 [ 143.550938] ? asm_exc_invalid_op+0x1f/0x30 [ 143.551736] ? rcu_note_context_switch+0x496/0x5d0 [ 143.552634] ? wakeup_preempt+0x62/0x70 [ 143.553358] __schedule+0xaa/0x1380 [ 143.554025] ? _raw_spin_unlock_irqrestore+0x12/0x40 [ 143.554958] ? try_to_wake_up+0x1fe/0x6b0 [ 143.555715] ? wake_up_process+0x19/0x20 [ 143.556452] schedule+0x2e/0x120 [ 143.557066] schedule_preempt_disabled+0x19/0x30 [ 143.557933] rwsem_down_read_slowpath+0x24d/0x4a0 [ 143.558818] ? xfs_efi_item_format+0x50/0xc0 [xfs] [ 143.559894] down_read+0x4e/0xb0 [ 143.560519] xlog_cil_commit+0x1b2/0xbc0 [xfs] [ 143.561460] ? _raw_spin_unlock+0x12/0x30 [ 143.562212] ? xfs_inode_item_precommit+0xc7/0x220 [xfs] [ 143.563309] ? xfs_trans_run_precommits+0x69/0xd0 [xfs] [ 143.564394] __xfs_trans_commit+0xb5/0x330 [xfs] [ 143.565367] xfs_trans_roll+0x48/0xc0 [xfs] [ 143.566262] xfs_defer_trans_roll+0x57/0x100 [xfs] [ 143.567278] xfs_defer_finish_noroll+0x27a/0x490 [xfs] [ 143.568342] xfs_defer_finish+0x1a/0x80 [xfs] [ 143.569267] xfs_bunmapi_range+0x4d/0xb0 [xfs] [ 143.570208] xfs_itruncate_extents_flags+0x13d/0x230 [xfs] [ 143.571353] xfs_free_eofblocks+0x12e/0x190 [xfs] [ 143.572359] xfs_file_release+0x12d/0x140 [xfs] [ 143.573324] __fput+0xe8/0x2d0 [ 143.573922] __fput_sync+0x1d/0x30 [ 143.574574] nfsd_filp_close+0x33/0x60 [nfsd] [ 143.575430] nfsd_file_free+0x96/0x150 [nfsd] [ 143.576274] nfsd_file_put+0xf7/0x1a0 [nfsd] [ 143.577104] nfsd_file_put_local+0x18/0x30 [nfsd] [ 143.578070] nfs_close_local_fh+0x101/0x110 [nfs_localio] [ 143.579079] __put_nfs_open_context+0xc9/0x180 [nfs] [ 143.580031] nfs_file_clear_open_context+0x4a/0x60 [nfs] [ 143.581038] nfs_file_release+0x3e/0x60 [nfs] [ 143.581879] __fput+0xe8/0x2d0 [ 143.582464] __fput_sync+0x1d/0x30 [ 143.583108] __x64_sys_close+0x41/0x80 [ 143.583823] x64_sys_call+0x189a/0x20d0 [ 143.584552] do_syscall_64+0x64/0x170 [ 143.585240] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 143.586185] RIP: 0033:0x7f3c5153efd7 Fixes: 65f2a5c36635 ("nfs_common: fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put()") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18nfsd: make use of warning provided by refcount_tNeilBrown
refcount_t, by design, checks for unwanted situations and provides warnings. It is rarely useful to have explicit warnings with refcount usage. In this case we have an explicit warning if a refcount_t reaches zero when decremented. Simply using refcount_dec() will provide a similar warning and also mark the refcount_t as saturated to avoid any possible use-after-free. This patch drops the warning and uses refcount_dec() instead of refcount_dec_and_test(). Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-11mm/list_lru: simplify the list_lru walk callback functionKairui Song
Now isolation no longer takes the list_lru global node lock, only use the per-cgroup lock instead. And this lock is inside the list_lru_one being walked, no longer needed to pass the lock explicitly. Link: https://lkml.kernel.org/r/20241104175257.60853-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11NFSD: Remove unused function parameterChuck Lever
Clean up: Commit 65294c1f2c5e ("nfsd: add a new struct file caching facility to nfsd") moved the fh_verify() call site out of nfsd_open(). That was the only user of nfsd_open's @rqstp parameter, so that parameter can be removed. Reviewed-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-11Merge tag 'nfs-for-6.12-2' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client fixes from Anna Schumaker: "Localio Bugfixes: - remove duplicated include in localio.c - fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put() - fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT - fix nfsd_file tracepoints to handle NULL rqstp pointers Other Bugfixes: - fix program selection loop in svc_process_common - fix integer overflow in decode_rc_list() - prevent NULL-pointer dereference in nfs42_complete_copies() - fix CB_RECALL performance issues when using a large number of delegations" * tag 'nfs-for-6.12-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFS: remove revoked delegation from server's delegation list nfsd/localio: fix nfsd_file tracepoints to handle NULL rqstp nfs_common: fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT nfs_common: fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put() NFSv4: Prevent NULL-pointer dereference in nfs42_complete_copies() SUNRPC: Fix integer overflow in decode_rc_list() sunrpc: fix prog selection loop in svc_process_common nfs: Remove duplicated include in localio.c
2024-10-10Merge tag 'nfsd-6.12-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix NFSD bring-up / shutdown - Fix a UAF when releasing a stateid * tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: fix possible badness in FREE_STATEID nfsd: nfsd_destroy_serv() must call svc_destroy() even if nfsd_startup_net() failed NFSD: Mark filecache "down" if init fails
2024-10-03nfs_common: fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put()Mike Snitzer
Add nfs_to_nfsd_file_put_local() interface to fix race with nfsd module unload. Similarly, use RCU around nfs_open_local_fh()'s error path call to nfs_to->nfsd_serv_put(). Holding RCU ensures that NFS will safely _call and return_ from its nfs_to calls into the NFSD functions nfsd_file_put_local() and nfsd_serv_put(). Otherwise, if RCU isn't used then there is a narrow window when NFS's reference for the nfsd_file and nfsd_serv are dropped and the NFSD module could be unloaded, which could result in a crash from the return instruction for either nfs_to->nfsd_file_put_local() or nfs_to->nfsd_serv_put(). Reported-by: NeilBrown <neilb@suse.de> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-10-02inotify: Fix possible deadlock in fsnotify_destroy_markLizhi Xu
[Syzbot reported] WARNING: possible circular locking dependency detected 6.11.0-rc4-syzkaller-00019-gb311c1b497e5 #0 Not tainted ------------------------------------------------------ kswapd0/78 is trying to acquire lock: ffff88801b8d8930 (&group->mark_mutex){+.+.}-{3:3}, at: fsnotify_group_lock include/linux/fsnotify_backend.h:270 [inline] ffff88801b8d8930 (&group->mark_mutex){+.+.}-{3:3}, at: fsnotify_destroy_mark+0x38/0x3c0 fs/notify/mark.c:578 but task is already holding lock: ffffffff8ea2fd60 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:6841 [inline] ffffffff8ea2fd60 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0xbb4/0x35a0 mm/vmscan.c:7223 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}-{0:0}: ... kmem_cache_alloc_noprof+0x3d/0x2a0 mm/slub.c:4044 inotify_new_watch fs/notify/inotify/inotify_user.c:599 [inline] inotify_update_watch fs/notify/inotify/inotify_user.c:647 [inline] __do_sys_inotify_add_watch fs/notify/inotify/inotify_user.c:786 [inline] __se_sys_inotify_add_watch+0x72e/0x1070 fs/notify/inotify/inotify_user.c:729 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #0 (&group->mark_mutex){+.+.}-{3:3}: ... __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752 fsnotify_group_lock include/linux/fsnotify_backend.h:270 [inline] fsnotify_destroy_mark+0x38/0x3c0 fs/notify/mark.c:578 fsnotify_destroy_marks+0x14a/0x660 fs/notify/mark.c:934 fsnotify_inoderemove include/linux/fsnotify.h:264 [inline] dentry_unlink_inode+0x2e0/0x430 fs/dcache.c:403 __dentry_kill+0x20d/0x630 fs/dcache.c:610 shrink_kill+0xa9/0x2c0 fs/dcache.c:1055 shrink_dentry_list+0x2c0/0x5b0 fs/dcache.c:1082 prune_dcache_sb+0x10f/0x180 fs/dcache.c:1163 super_cache_scan+0x34f/0x4b0 fs/super.c:221 do_shrink_slab+0x701/0x1160 mm/shrinker.c:435 shrink_slab+0x1093/0x14d0 mm/shrinker.c:662 shrink_one+0x43b/0x850 mm/vmscan.c:4815 shrink_many mm/vmscan.c:4876 [inline] lru_gen_shrink_node mm/vmscan.c:4954 [inline] shrink_node+0x3799/0x3de0 mm/vmscan.c:5934 kswapd_shrink_node mm/vmscan.c:6762 [inline] balance_pgdat mm/vmscan.c:6954 [inline] kswapd+0x1bcd/0x35a0 mm/vmscan.c:7223 [Analysis] The problem is that inotify_new_watch() is using GFP_KERNEL to allocate new watches under group->mark_mutex, however if dentry reclaim races with unlinking of an inode, it can end up dropping the last dentry reference for an unlinked inode resulting in removal of fsnotify mark from reclaim context which wants to acquire group->mark_mutex as well. This scenario shows that all notification groups are in principle prone to this kind of a deadlock (previously, we considered only fanotify and dnotify to be problematic for other reasons) so make sure all allocations under group->mark_mutex happen with GFP_NOFS. Reported-and-tested-by: syzbot+c679f13773f295d2da53@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c679f13773f295d2da53 Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240927143642.2369508-1-lizhi.xu@windriver.com
2024-09-23nfsd: add LOCALIO supportWeston Andros Adamson
Add server support for bypassing NFS for localhost reads, writes, and commits. This is only useful when both the client and server are running on the same host. If nfsd_open_local_fh() fails then the NFS client will both retry and fallback to normal network-based read, write and commit operations if localio is no longer supported. Care is taken to ensure the same NFS security mechanisms are used (authentication, etc) regardless of whether localio or regular NFS access is used. The auth_domain established as part of the traditional NFS client access to the NFS server is also used for localio. Store auth_domain for localio in nfsd_uuid_t and transfer it to the client if it is local to the server. Relative to containers, localio gives the client access to the network namespace the server has. This is required to allow the client to access the server's per-namespace nfsd_net struct. This commit also introduces the use of NFSD's percpu_ref to interlock nfsd_destroy_serv and nfsd_open_local_fh, to ensure nn->nfsd_serv is not destroyed while in use by nfsd_open_local_fh and other LOCALIO client code. CONFIG_NFS_LOCALIO enables NFS server support for LOCALIO. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Co-developed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Co-developed-by: NeilBrown <neilb@suse.de> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23nfs_common: prepare for the NFS client to use nfsd_file for LOCALIOMike Snitzer
The next commit will introduce nfsd_open_local_fh() which returns an nfsd_file structure. This commit exposes LOCALIO's required NFSD symbols to the NFS client: - Make nfsd_open_local_fh() symbol and other required NFSD symbols available to NFS in a global 'nfs_to' nfsd_localio_operations struct (global access suggested by Trond, nfsd_localio_operations suggested by NeilBrown). The next commit will also introduce nfsd_localio_ops_init() that init_nfsd() will call to initialize 'nfs_to'. - Introduce nfsd_file_file() that provides access to nfsd_file's backing file. Keeps nfsd_file structure opaque to NFS client (as suggested by Jeff Layton). - Introduce nfsd_file_put_local() that will put the reference to the nfsd_file's associated nn->nfsd_serv and then put the reference to the nfsd_file (as suggested by NeilBrown). Suggested-by: Trond Myklebust <trond.myklebust@hammerspace.com> # nfs_to Suggested-by: NeilBrown <neilb@suse.de> # nfsd_localio_operations Suggested-by: Jeff Layton <jlayton@kernel.org> # nfsd_file_file Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23nfsd: add nfsd_file_acquire_local()NeilBrown
nfsd_file_acquire_local() can be used to look up a file by filehandle without having a struct svc_rqst. This can be used by NFS LOCALIO to allow the NFS client to bypass the NFS protocol to directly access a file provided by the NFS server which is running in the same kernel. In nfsd_file_do_acquire() care is taken to always use fh_verify() if rqstp is not NULL (as is the case for non-LOCALIO callers). Otherwise the non-LOCALIO callers will not supply the correct and required arguments to __fh_verify (e.g. gssclient isn't passed). Introduce fh_verify_local() wrapper around __fh_verify to make it clear that LOCALIO is intended caller. Also, use GC for nfsd_file returned by nfsd_file_acquire_local. GC offers performance improvements if/when a file is reopened before launderette cleans it from the filecache's LRU. Suggested-by: Jeff Layton <jlayton@kernel.org> # use filecache's GC Signed-off-by: NeilBrown <neilb@suse.de> Co-developed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-23NFSD: Mark filecache "down" if init failsChuck Lever
NeilBrown says: > The handling of NFSD_FILE_CACHE_UP is strange. nfsd_file_cache_init() > sets it, but doesn't clear it on failure. So if nfsd_file_cache_init() > fails for some reason, nfsd_file_cache_shutdown() would still try to > clean up if it was called. Reported-by: NeilBrown <neilb@suse.de> Fixes: c7b824c3d06c ("NFSD: Replace the "init once" mechanism") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-20nfsd: remove unused parameter of nfsd_file_mark_find_or_createLi Lingfeng
Commit 427f5f83a319 ("NFSD: Ensure nf_inode is never dereferenced") passes inode directly to nfsd_file_mark_find_or_create instead of getting it from nf, so there is no need to pass nf. Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-01nfsd: use system_unbound_wq for nfsd_file_gc_worker()Youzhong Yang
After many rounds of changes in filecache.c, the fix by commit ce7df055(NFSD: Make the file_delayed_close workqueue UNBOUND) is gone, now we are getting syslog messages like these: [ 1618.186688] workqueue: nfsd_file_gc_worker [nfsd] hogged CPU for >13333us 4 times, consider switching to WQ_UNBOUND [ 1638.661616] workqueue: nfsd_file_gc_worker [nfsd] hogged CPU for >13333us 8 times, consider switching to WQ_UNBOUND [ 1665.284542] workqueue: nfsd_file_gc_worker [nfsd] hogged CPU for >13333us 16 times, consider switching to WQ_UNBOUND [ 1759.491342] workqueue: nfsd_file_gc_worker [nfsd] hogged CPU for >13333us 32 times, consider switching to WQ_UNBOUND [ 3013.012308] workqueue: nfsd_file_gc_worker [nfsd] hogged CPU for >13333us 64 times, consider switching to WQ_UNBOUND [ 3154.172827] workqueue: nfsd_file_gc_worker [nfsd] hogged CPU for >13333us 128 times, consider switching to WQ_UNBOUND [ 3422.461924] workqueue: nfsd_file_gc_worker [nfsd] hogged CPU for >13333us 256 times, consider switching to WQ_UNBOUND [ 3963.152054] workqueue: nfsd_file_gc_worker [nfsd] hogged CPU for >13333us 512 times, consider switching to WQ_UNBOUND Consider use system_unbound_wq instead of system_wq for nfsd_file_gc_worker(). Signed-off-by: Youzhong Yang <youzhong@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-01nfsd: count nfsd_file allocationsJeff Layton
We already count the frees (via nfsd_file_releases). Count the allocations as well. Also switch the direct call to nfsd_file_slab_free in nfsd_file_do_acquire to nfsd_file_free, so that the allocs and releases match up. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-01nfsd: fix refcount leak when file is unhashed after being foundJeff Layton
If we wait_for_construction and find that the file is no longer hashed, and we're going to retry the open, the old nfsd_file reference is currently leaked. Put the reference before retrying. Fixes: c6593366c0bf ("nfsd: don't kill nfsd_files because of lease break error") Signed-off-by: Jeff Layton <jlayton@kernel.org> Tested-by: Youzhong Yang <youzhong@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-01nfsd: remove unneeded EEXIST error check in nfsd_do_file_acquireJeff Layton
Given that we do the search and insertion while holding the i_lock, I don't think it's possible for us to get EEXIST here. Remove this case. Fixes: c6593366c0bf ("nfsd: don't kill nfsd_files because of lease break error") Signed-off-by: Jeff Layton <jlayton@kernel.org> Tested-by: Youzhong Yang <youzhong@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-09-01nfsd: add list_head nf_gc to struct nfsd_fileYouzhong Yang
nfsd_file_put() in one thread can race with another thread doing garbage collection (running nfsd_file_gc() -> list_lru_walk() -> nfsd_file_lru_cb()): * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). * nfsd_file_lru_add() returns true (with NFSD_FILE_REFERENCED bit set) * garbage collector kicks in, nfsd_file_lru_cb() clears REFERENCED bit and returns LRU_ROTATE. * garbage collector kicks in again, nfsd_file_lru_cb() now decrements nf->nf_ref to 0, runs nfsd_file_unhash(), removes it from the LRU and adds to the dispose list [list_lru_isolate_move(lru, &nf->nf_lru, head)] * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it tries to remove the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))]. The 'nf' has been added to the 'dispose' list by nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) simply treats it as part of the LRU and removes it, which leads to its removal from the 'dispose' list. * At this moment, 'nf' is unhashed with its nf_ref being 0, and not on the LRU. nfsd_file_put() continues its execution [if (refcount_dec_and_test(&nf->nf_ref))], as nf->nf_ref is already 0, nf->nf_ref is set to REFCOUNT_SATURATED, and the 'nf' gets no chance of being freed. nfsd_file_put() can also race with nfsd_file_cond_queue(): * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add(). * nfsd_file_lru_add() sets REFERENCED bit and returns true. * Some userland application runs 'exportfs -f' or something like that, which triggers __nfsd_file_cache_purge() -> nfsd_file_cond_queue(). * In nfsd_file_cond_queue(), it runs [if (!nfsd_file_unhash(nf))], unhash is done successfully. * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now nf->nf_ref goes to 2. * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], it succeeds. * nfsd_file_cond_queue() runs [if (refcount_sub_and_test(decrement, &nf->nf_ref))] (with "decrement" being 2), so the nf->nf_ref goes to 0, the 'nf' is added to the dispose list [list_add(&nf->nf_lru, dispose)] * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it tries to remove the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))], although the 'nf' is not in the LRU, but it is linked in the 'dispose' list, nfsd_file_lru_remove() simply treats it as part of the LRU and removes it. This leads to its removal from the 'dispose' list! * Now nf->ref is 0, unhashed. nfsd_file_put() continues its execution and set nf->nf_ref to REFCOUNT_SATURATED. As shown in the above analysis, using nf_lru for both the LRU list and dispose list can cause the leaks. This patch adds a new list_head nf_gc in struct nfsd_file, and uses it for the dispose list. This does not fix the nfsd_file leaking issue completely. Signed-off-by: Youzhong Yang <youzhong@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-12nfsd: nfsd_file_lease_notifier_call gets a file_lease as an argumentJeff Layton
"data" actually refers to a file_lease and not a file_lock. Both structs have their file_lock_core as the first field though, so this bug should be harmless without struct randomization in play. Reported-by: Florian Evers <florian-evers@gmx.de> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219008 Fixes: 05580bbfc6bc ("nfsd: adapt to breakup of struct file_lock") Signed-off-by: Jeff Layton <jlayton@kernel.org> Tested-by: Florian Evers <florian-evers@gmx.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-04fsnotify: create a wrapper fsnotify_find_inode_mark()Amir Goldstein
In preparation to passing an object pointer to fsnotify_find_mark(), add a wrapper fsnotify_find_inode_mark() and use it where possible. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20240317184154.1200192-4-amir73il@gmail.com>
2024-03-12Merge tag 'nfsd-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds
Pull nfsd updates from Chuck Lever: "The bulk of the patches for this release are optimizations, code clean-ups, and minor bug fixes. One new feature to mention is that NFSD administrators now have the ability to revoke NFSv4 open and lock state. NFSD's NFSv3 support has had this capability for some time. As always I am grateful to NFSD contributors, reviewers, and testers" * tag 'nfsd-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (75 commits) NFSD: Clean up nfsd4_encode_replay() NFSD: send OP_CB_RECALL_ANY to clients when number of delegations reaches its limit NFSD: Document nfsd_setattr() fill-attributes behavior nfsd: Fix NFSv3 atomicity bugs in nfsd_setattr() nfsd: Fix a regression in nfsd_setattr() NFSD: OP_CB_RECALL_ANY should recall both read and write delegations NFSD: handle GETATTR conflict with write delegation NFSD: add support for CB_GETATTR callback NFSD: Document the phases of CREATE_SESSION NFSD: Fix the NFSv4.1 CREATE_SESSION operation nfsd: clean up comments over nfs4_client definition svcrdma: Add Write chunk WRs to the RPC's Send WR chain svcrdma: Post WRs for Write chunks in svc_rdma_sendto() svcrdma: Post the Reply chunk and Send WR together svcrdma: Move write_info for Reply chunks into struct svc_rdma_send_ctxt svcrdma: Post Send WR chain svcrdma: Fix retry loop in svc_rdma_send() svcrdma: Prevent a UAF in svc_rdma_send() svcrdma: Fix SQ wake-ups svcrdma: Increase the per-transport rw_ctx count ...
2024-03-01nfsd: Simplify the allocation of slab caches in nfsd_file_cache_initKunwu Chan
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: use __fput_sync() to avoid delayed closing of files.NeilBrown
Calling fput() directly or though filp_close() from a kernel thread like nfsd causes the final __fput() (if necessary) to be called from a workqueue. This means that nfsd is not forced to wait for any work to complete. If the ->release or ->destroy_inode function is slow for any reason, this can result in nfsd closing files more quickly than the workqueue can complete the close and the queue of pending closes can grow without bounces (30 million has been seen at one customer site, though this was in part due to a slowness in xfs which has since been fixed). nfsd does not need this. It is quite appropriate and safe for nfsd to do its own close work. There is no reason that close should ever wait for nfsd, so no deadlock can occur. It should be safe and sensible to change all fput() calls to __fput_sync(). However in the interests of caution this patch only changes two - the two that can be most directly affected by client behaviour and could occur at high frequency. - the fput() implicitly in flip_close() is changed to __fput_sync() by calling get_file() first to ensure filp_close() doesn't do the final fput() itself. If is where files opened for IO are closed. - the fput() in nfsd_read() is also changed. This is where directories opened for readdir are closed. This ensure that minimal fput work is queued to the workqueue. This removes the need for the flush_delayed_fput() call in nfsd_file_close_inode_sync() Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-03-01nfsd: Don't leave work of closing files to a work queueNeilBrown
The work of closing a file can have non-trivial cost. Doing it in a separate work queue thread means that cost isn't imposed on the nfsd threads and an imbalance can be created. This can result in files being queued for the work queue more quickly that the work queue can process them, resulting in unbounded growth of the queue and memory exhaustion. To avoid this work imbalance that exhausts memory, this patch moves all closing of files into the nfsd threads. This means that when the work imposes a cost, that cost appears where it would be expected - in the work of the nfsd thread. A subsequent patch will ensure the final __fput() is called in the same (nfsd) thread which calls filp_close(). Files opened for NFSv3 are never explicitly closed by the client and are kept open by the server in the "filecache", which responds to memory pressure, is garbage collected even when there is no pressure, and sometimes closes files when there is particular need such as for rename. These files currently have filp_close() called in a dedicated work queue, so their __fput() can have no effect on nfsd threads. This patch discards the work queue and instead has each nfsd thread call flip_close() on as many as 8 files from the filecache each time it acts on a client request (or finds there are no pending client requests). If there are more to be closed, more threads are woken. This spreads the work of __fput() over multiple threads and imposes any cost on those threads. The number 8 is somewhat arbitrary. It needs to be greater than 1 to ensure that files are closed more quickly than they can be added to the cache. It needs to be small enough to limit the per-request delays that will be imposed on clients when all threads are busy closing files. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-02-05nfsd: adapt to breakup of struct file_lockJeff Layton
Most of the existing APIs have remained the same, but subsystems that access file_lock fields directly need to reach into struct file_lock_core now. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20240131-flsplit-v3-42-c6129007ee8d@kernel.org Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>