| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Add mechanism for cleaning out unused, stale dentries; controlled via
a module option (Luis Henriques)
- Fix various bugs
- Cleanups
* tag 'fuse-update-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: Uninitialized variable in fuse_epoch_work()
fuse: fix io-uring list corruption for terminated non-committed requests
fuse: signal that a fuse inode should exhibit local fs behaviors
fuse: Always flush the page cache before FOPEN_DIRECT_IO write
fuse: Invalidate the page cache after FOPEN_DIRECT_IO write
fuse: rename 'namelen' to 'namesize'
fuse: use strscpy instead of strcpy
fuse: refactor fuse_conn_put() to remove negative logic.
fuse: new work queue to invalidate dentries from old epochs
fuse: new work queue to periodically invalidate expired dentries
dcache: export shrink_dentry_list() and add new helper d_dispose_if_unused()
fuse: add WARN_ON and comment for RCU revalidate
fuse: Fix whitespace for fuse_uring_args_to_ring() comment
fuse: missing copy_finish in fuse-over-io-uring argument copies
fuse: fix readahead reclaim deadlock
|
|
Create a new fuse inode flag that indicates that the kernel should
implement various local filesystem behaviors instead of passing vfs
commands straight through to the fuse server and expecting the server to
do all the work. For example, this means that we'll use the kernel to
transform some ACL updates into mode changes, and later to do
enforcement of the immutable and append iflags.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
With the infrastructure introduced to periodically invalidate expired
dentries, it is now possible to add an extra work queue to invalidate
dentries when an epoch is incremented. This work queue will only be
triggered when the 'inval_wq' parameter is set.
Signed-off-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This patch adds the necessary infrastructure to keep track of all dentries
created for FUSE file systems. A set of rbtrees, protected by hashed
locks, will be used to keep all these dentries sorted by expiry time.
A new module parameter 'inval_wq' is also added. When set, it will start
a work queue which will periodically invalidate expired dentries. The
value of this new parameter is the period, in seconds, for this work
queue. Once this parameter is set, every new dentry will be added to one
of the rbtrees.
When the work queue is executed, it will check all the rbtrees and will
invalidate those dentries that have timed-out.
The work queue period can not be smaller than 5 seconds, but can be
disabled by setting 'inval_wq' to zero (which is the default).
Signed-off-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Now that fuse is integrated with iomap for read/readahead, we can remove
the workaround that was added in commit bd24d2108e9c ("fuse: fix fuseblk
i_blkbits for iomap partial writes"), which was previously needed to
avoid a race condition where an iomap partial write may be overwritten
by a read if blocksize < PAGE_SIZE. Now that fuse does iomap
read/readahead, this is protected against since there is granular
uptodate tracking of blocks, which means this workaround can be removed.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Tested-by: syzbot@syzkaller.appspotmail.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
iomap support for fuse is also going to want the ability to attach
backing files to a fuse filesystem. Move the fuse_backing code into a
separate file so that both can use it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The fuse_request_{send,end} tracepoints capture the value of
req->in.h.unique in the trace output. It would be really nice if we
could use this to match a request to its response for debugging and
latency analysis, but the call to trace_fuse_request_send occurs before
the unique id has been set:
fuse_request_send: connection 8388608 req 0 opcode 1 (FUSE_LOOKUP) len 107
fuse_request_end: connection 8388608 req 6 len 16 error -2
(Notice that req moves from 0 to 6)
Move the callsites to trace_fuse_request_send to after the unique id has
been set by introducing a helper to do that for standard fuse_req
requests. FUSE_FORGET requests are not covered by this because they
appear to be synthesized into the event stream without a fuse_req
object and are never replied to.
Requests that are aborted without ever having been submitted to the fuse
server retain the behavior that only the fuse_request_end tracepoint
shows up in the trace record, and with req==0.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Some fuse servers need to prune their caches, which can only be done if the
kernel's own dentry/inode caches are pruned first to avoid dangling
references.
Add FUSE_NOTIFY_PRUNE, which takes an array of node ID's to try and get rid
of. Inodes with active references are skipped.
A similar functionality is already provided by FUSE_NOTIFY_INVAL_ENTRY with
the FUSE_EXPIRE_ONLY flag. Differences in the interface are
FUSE_NOTIFY_INVAL_ENTRY:
- can only prune one dentry
- dentry is determined by parent ID and name
- if inode has multiple aliases (cached hard links), then they would have
to be invalidated individually to be able to get rid of the inode
FUSE_NOTIFY_PRUNE:
- can prune multiple inodes
- inodes determined by their node ID
- aliases are taken care of automatically
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
FUSE_INIT has always been asynchronous with mount. That means that the
server processed this request after the mount syscall returned.
This means that FUSE_INIT can't supply the root inode's ID, hence it
currently has a hardcoded value. There are other limitations such as not
being able to perform getxattr during mount, which is needed by selinux.
To remove these limitations allow server to process FUSE_INIT while
initializing the in-core super block for the fuse filesystem. This can
only be done if the server is prepared to handle this, so add
FUSE_DEV_IOC_SYNC_INIT ioctl, which
a) lets the server know whether this feature is supported, returning
ENOTTY othewrwise.
b) lets the kernel know to perform a synchronous initialization
The implementation is slightly tricky, since fuse_dev/fuse_conn are set up
only during super block creation. This is solved by setting the private
data of the fuse device file to a special value ((struct fuse_dev *) 1) and
waiting for this to be turned into a proper fuse_dev before commecing with
operations on the device file.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The 'inode' parameter in fuse_passthrough_open() is never referenced
in the function implementation.
Signed-off-by: Chunsheng Luo <luochunsheng@ustc.edu>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The FUSE protocol uses struct fuse_write_out to convey the return value of
copy_file_range, which is restricted to uint32_t. But the COPY_FILE_RANGE
interface supports a 64-bit size copies and there's no reason why copies
should be limited to 32-bit.
Introduce a new op COPY_FILE_RANGE_64, which is identical, except the
number of bytes copied is returned in a 64-bit value.
If the fuse server does not support COPY_FILE_RANGE_64, fall back to
COPY_FILE_RANGE.
Reported-by: Florian Weimer <fweimer@redhat.com>
Closes: https://lore.kernel.org/all/lhuh5ynl8z5.fsf@oldenburg.str.redhat.com/
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
On regular fuse filesystems, i_blkbits is set to PAGE_SHIFT which means
any iomap partial writes will mark the entire folio as uptodate. However
fuseblk filesystems work differently and allow the blocksize to be less
than the page size. As such, this may lead to data corruption if fuseblk
sets its blocksize to less than the page size, uses the writeback cache,
and does a partial write, then a read and the read happens before the
write has undergone writeback, since the folio will not be marked
uptodate from the partial write so the read will read in the entire
folio from disk, which will overwrite the partial write.
The long-term solution for this, which will also be needed for fuse to
enable large folios with the writeback cache on, is to have fuse also
use iomap for folio reads, but until that is done, the cleanest
workaround is to use the page size for fuseblk's internal kernel inode
blksize/blkbits values while maintaining current behavior for stat().
This was verified using ntfs-3g:
$ sudo mkfs.ntfs -f -c 512 /dev/vdd1
$ sudo ntfs-3g /dev/vdd1 ~/fuseblk
$ stat ~/fuseblk/hi.txt
IO Block: 512
Fixes: a4c9ab1d4975 ("fuse: use iomap for buffered writes")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
As pointed out by Miklos[1], in the fuse_update_get_attr() path, the
attributes returned to stat may be cached values instead of fresh ones
fetched from the server. In the case where the server returned a
modified blocksize value, we need to cache it and reflect it back to
stat if values are not re-fetched since we now no longer directly change
inode->i_blkbits.
Link: https://lore.kernel.org/linux-fsdevel/CAJfpeguCOxeVX88_zPd1hqziB_C+tmfuDhZP5qO2nKmnb-dTUA@mail.gmail.com/ [1]
Fixes: 542ede096e48 ("fuse: keep inode->i_blkbits constant")
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fileattr updates from Christian Brauner:
"This introduces the new file_getattr() and file_setattr() system calls
after lengthy discussions.
Both system calls serve as successors and extensible companions to
the FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR system calls which have
started to show their age in addition to being named in a way that
makes it easy to conflate them with extended attribute related
operations.
These syscalls allow userspace to set filesystem inode attributes on
special files. One of the usage examples is the XFS quota projects.
XFS has project quotas which could be attached to a directory. All new
inodes in these directories inherit project ID set on parent
directory.
The project is created from userspace by opening and calling
FS_IOC_FSSETXATTR on each inode. This is not possible for special
files such as FIFO, SOCK, BLK etc. Therefore, some inodes are left
with empty project ID. Those inodes then are not shown in the quota
accounting but still exist in the directory. This is not critical but
in the case when special files are created in the directory with
already existing project quota, these new inodes inherit extended
attributes. This creates a mix of special files with and without
attributes. Moreover, special files with attributes don't have a
possibility to become clear or change the attributes. This, in turn,
prevents userspace from re-creating quota project on these existing
files.
In addition, these new system calls allow the implementation of
additional attributes that we couldn't or didn't want to fit into the
legacy ioctls anymore"
* tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: tighten a sanity check in file_attr_to_fileattr()
tree-wide: s/struct fileattr/struct file_kattr/g
fs: introduce file_getattr and file_setattr syscalls
fs: prepare for extending file_get/setattr()
fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP
selinux: implement inode_file_[g|s]etattr hooks
lsm: introduce new hooks for setting/getting inode fsxattr
fs: split fileattr related helpers into separate file
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull simple_recursive_removal() update from Al Viro:
"Removing subtrees of kernel filesystems is done in quite a few places;
unfortunately, it's easy to get wrong. A number of open-coded attempts
are out there, with varying amount of bogosities.
simple_recursive_removal() had been introduced for doing that with all
precautions needed; it does an equivalent of rm -rf, with sufficient
locking, eviction of anything mounted on top of the subtree, etc.
This series converts a bunch of open-coded instances to using that"
* tag 'pull-simple_recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
functionfs, gadgetfs: use simple_recursive_removal()
kill binderfs_remove_file()
fuse_ctl: use simple_recursive_removal()
pstore: switch to locked_recursive_removal()
binfmt_misc: switch to locked_recursive_removal()
spufs: switch to locked_recursive_removal()
add locked_recursive_removal()
better lockdep annotations for simple_recursive_removal()
simple_recursive_removal(): saner interaction with fsnotify
|
|
Now that we expose struct file_attr as our uapi struct rename all the
internal struct to struct file_kattr to clearly communicate that it is a
kernel internal struct. This is similar to struct mount_{k}attr and
others.
Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
easier that way - no need to keep that array of dentry references, etc.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
->d_revalidate() is never called for root anyway...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
In the current FUSE writeback design (see commit 3be5a52b30aa
("fuse: support writable mmap")), a temp page is allocated for every
dirty page to be written back, the contents of the dirty page are copied
over to the temp page, and the temp page gets handed to the server to
write back.
This is done so that writeback may be immediately cleared on the dirty
page, and this in turn is done in order to mitigate the following
deadlock scenario that may arise if reclaim waits on writeback on the
dirty page to complete:
* single-threaded FUSE server is in the middle of handling a request
that needs a memory allocation
* memory allocation triggers direct reclaim
* direct reclaim waits on a folio under writeback
* the FUSE server can't write back the folio since it's stuck in
direct reclaim
With a recent change that added AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM and
mitigates the situation described above, FUSE writeback does not need
to use temp pages if it sets AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM on its
inode mappings.
This commit sets AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM on the inode
mappings and removes the temporary pages + extra copying and the internal
rb tree.
fio benchmarks --
(using averages observed from 10 runs, throwing away outliers)
Setup:
sudo mount -t tmpfs -o size=30G tmpfs ~/tmp_mount
./libfuse/build/example/passthrough_ll -o writeback -o max_threads=4 -o source=~/tmp_mount ~/fuse_mount
fio --name=writeback --ioengine=sync --rw=write --bs={1k,4k,1M} --size=2G
--numjobs=2 --ramp_time=30 --group_reporting=1 --directory=/root/fuse_mount
bs = 1k 4k 1M
Before 351 MiB/s 1818 MiB/s 1851 MiB/s
After 341 MiB/s 2246 MiB/s 2685 MiB/s
% diff -3% 23% 45%
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Currently userspace is able to notify the kernel to invalidate the cache
for an inode. This means that, if all the inodes in a filesystem need to
be invalidated, then userspace needs to iterate through all of them and do
this kernel notification separately.
This patch adds the concept of 'epoch': each fuse connection will have the
current epoch initialized and every new dentry will have it's d_time set to
the current epoch value. A new operation will then allow userspace to
increment the epoch value. Every time a dentry is d_revalidate()'ed, it's
epoch is compared with the current connection epoch and invalidated if it's
value is different.
Signed-off-by: Luis Henriques <luis@igalia.com>
Tested-by: Laura Promberger <laura.promberger@cern.ch>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Prefer 'unsigned int' to bare 'unsigned', as reported by checkpatch.pl:
WARNING: Prefer 'unsigned int' to bare use of 'unsigned'.
Signed-off-by: Jiale Yang <295107659@qq.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Our file system has a translation capability for S3-to-posix.
The current value of 1kiB is enough to cover S3 keys, but
does not allow encoding of %xx escape characters.
The limit is increased to (PATH_MAX - 1), as we need
3 x 1024 and that is close to PATH_MAX (4kB) already.
-1 is used as the terminating null is not included in the
length calculation.
Testing large file names was hard with libfuse/example file systems,
so I created a new memfs that does not have a 255 file name length
limitation.
https://github.com/libfuse/libfuse/pull/1077
The connection is initialized with FUSE_NAME_LOW_MAX, which
is set to the previous value of FUSE_NAME_MAX of 1024. With
FUSE_MIN_READ_BUFFER of 8192 that is enough for two file names
+ fuse headers.
When FUSE_INIT reply sets max_pages to a value > 1 we know
that fuse daemon supports request buffers of at least 2 pages
(+ header) and can therefore hold 2 x PATH_MAX file names - operations
like rename or link that need two file names are no issue then.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Introduce two new sysctls, "default_request_timeout" and
"max_request_timeout". These control how long (in seconds) a server can
take to reply to a request. If the server does not reply by the timeout,
then the connection will be aborted. The upper bound on these sysctl
values is 65535.
"default_request_timeout" sets the default timeout if no timeout is
specified by the fuse server on mount. 0 (default) indicates no default
timeout should be enforced. If the server did specify a timeout, then
default_request_timeout will be ignored.
"max_request_timeout" sets the max amount of time the server may take to
reply to a request. 0 (default) indicates no maximum timeout. If
max_request_timeout is set and the fuse server attempts to set a
timeout greater than max_request_timeout, the system will use
max_request_timeout as the timeout. Similarly, if default_request_timeout
is greater than max_request_timeout, the system will use
max_request_timeout as the timeout. If the server does not request a
timeout and default_request_timeout is set to 0 but max_request_timeout
is set, then the timeout will be max_request_timeout.
Please note that these timeouts are not 100% precise. The request may
take roughly an extra FUSE_TIMEOUT_TIMER_FREQ seconds beyond the set max
timeout due to how it's internally implemented.
$ sysctl -a | grep fuse.default_request_timeout
fs.fuse.default_request_timeout = 0
$ echo 65536 | sudo tee /proc/sys/fs/fuse/default_request_timeout
tee: /proc/sys/fs/fuse/default_request_timeout: Invalid argument
$ echo 65535 | sudo tee /proc/sys/fs/fuse/default_request_timeout
65535
$ sysctl -a | grep fuse.default_request_timeout
fs.fuse.default_request_timeout = 65535
$ echo 0 | sudo tee /proc/sys/fs/fuse/default_request_timeout
0
$ sysctl -a | grep fuse.default_request_timeout
fs.fuse.default_request_timeout = 0
[Luis Henriques: Limit the timeout to the range [FUSE_TIMEOUT_TIMER_FREQ,
fuse_max_req_timeout]]
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Luis Henriques <luis@igalia.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
There are situations where fuse servers can become unresponsive or
stuck, for example if the server is deadlocked. Currently, there's no
good way to detect if a server is stuck and needs to be killed manually.
This commit adds an option for enforcing a timeout (in seconds) for
requests where if the timeout elapses without the server responding to
the request, the connection will be automatically aborted.
Please note that these timeouts are not 100% precise. For example, the
request may take roughly an extra FUSE_TIMEOUT_TIMER_FREQ seconds beyond
the requested timeout due to internal implementation, in order to
mitigate overhead.
[SzM: Bump the API version number]
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
If filesystem doesn't support FUSE_LINK (i.e. returns -ENOSYS), then
remember this and next time return immediately, without incurring the
overhead of a round trip to the server.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
task-A (application) might be in request_wait_answer and
try to remove the request when it has FR_PENDING set.
task-B (a fuse-server io-uring task) might handle this
request with FUSE_IO_URING_CMD_COMMIT_AND_FETCH, when
fetching the next request and accessed the req from
the pending list in fuse_uring_ent_assign_req().
That code path was not protected by fiq->lock and so
might race with task-A.
For scaling reasons we better don't use fiq->lock, but
add a handler to remove canceled requests from the queue.
This also removes usage of fiq->lock from
fuse_uring_add_req_to_ring_ent() altogether, as it was
there just to protect against this race and incomplete.
Also added is a comment why FR_PENDING is not cleared.
Fixes: c090c8abae4b ("fuse: Add io-uring sqe commit and fetch support")
Cc: <stable@vger.kernel.org> # v6.14
Reported-by: Joanne Koong <joannelkoong@gmail.com>
Closes: https://lore.kernel.org/all/CAJnrk1ZgHNb78dz-yfNTpxmW7wtT88A=m-zF0ZoLXKLUHRjNTw@mail.gmail.com/
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Avoid races and block request allocation until io-uring
queues are ready.
This is a especially important for background requests,
as bg request completion might cause lock order inversion
of the typical queue->lock and then fc->bg_lock
fuse_request_end
spin_lock(&fc->bg_lock);
flush_bg_queue
fuse_send_one
fuse_uring_queue_fuse_req
spin_lock(&queue->lock);
Signed-off-by: Bernd Schubert <bernd@bsbernd.com>
Reviewed-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This adds support for fuse request completion through ring SQEs
(FUSE_URING_CMD_COMMIT_AND_FETCH handling). After committing
the ring entry it becomes available for new fuse requests.
Handling of requests through the ring (SQE/CQE handling)
is complete now.
Fuse request data are copied through the mmaped ring buffer,
there is no support for any zero copy yet.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring
Reviewed-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
fuse-over-io-uring uses existing functions to find requests based
on their unique id - make these functions non-static.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This adds basic support for ring SQEs (with opcode=IORING_OP_URING_CMD).
For now only FUSE_IO_URING_CMD_REGISTER is handled to register queue
entries.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring
Reviewed-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This change sets up FUSE operations to always have headers in
args.in_args[0], even for opcodes without an actual header.
This step prepares for a clean separation of payload from headers,
initially it is used by fuse-over-io-uring.
For opcodes without a header, we use a zero-sized struct as a
placeholder. This approach:
- Keeps things consistent across all FUSE operations
- Will help with payload alignment later
- Avoids future issues when header sizes change
Op codes that already have an op code specific header do not
need modification.
Op codes that have neither payload nor op code headers
are not modified either (FUSE_READLINK and FUSE_DESTROY).
FUSE_BATCH_FORGET already has the header in the right place,
but is not using fuse_copy_args - as -over-uring is currently
not handling forgets it does not matter for now, but header
separation will later need special attention for that op code.
Correct the struct fuse_args->in_args array max size.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Function fuse_direntplus_link() might call fuse_iget() to initialize a new
fuse_inode and change its attributes. If fi->attr_version is always
initialized with 0, even if the attributes returned by the FUSE_READDIR
request is staled, as the new fi->attr_version is 0, fuse_change_attributes
will still set the staled attributes to inode. This wrong behaviour may
cause file size inconsistency even when there is no changes from
server-side.
To reproduce the issue, consider the following 2 programs (A and B) are
running concurrently,
A B
---------------------------------- --------------------------------
{ /fusemnt/dir/f is a file path in a fuse mount, the size of f is 0. }
readdir(/fusemnt/dir) start
//Daemon set size 0 to f direntry
fallocate(f, 1024)
stat(f) // B see size 1024
echo 2 > /proc/sys/vm/drop_caches
readdir(/fusemnt/dir) reply to kernel
Kernel set 0 to the I_NEW inode
stat(f) // B see size 0
In the above case, only program B is modifying the file size, however, B
observes file size changing between the 2 'readonly' stat() calls. To fix
this issue, we should make sure readdirplus still follows the rule of
attr_version staleness checking even if the fi->attr_version is lost due to
inode eviction.
To identify this situation, the new fc->evict_ctr is used to record whether
the eviction of inodes occurs during the readdirplus request processing.
If it does, the result of readdirplus may be inaccurate; otherwise, the
result of readdirplus can be trusted. Although this may still lead to
incorrect invalidation, considering the relatively low frequency of
evict occurrences, it should be acceptable.
Link: https://lore.kernel.org/lkml/20230711043405.66256-2-zhangjiachen.jaycee@bytedance.com/
Link: https://lore.kernel.org/lkml/20241114070905.48901-1-zhangtianci.1997@bytedance.com/
Reported-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Zhang Tianci <zhangtianci.1997@bytedance.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
All fuse requests use folios instead of pages for transferring data.
Remove pages from the requests and exclusively use folios.
No functional changes.
[SzM: rename back folio_descs -> descs, etc.]
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Convert direct io requests to use folios instead of pages.
No functional changes.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Convert ioctl requests to use folios instead of pages.
No functional changes.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Convert non-writeback write requests to use folios instead of pages.
No functional changes.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Convert read requests to use folios instead of pages.
No functional changes.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This adds support in struct fuse_args_pages and fuse_copy_pages() for
using folios instead of pages for transferring data. Both folios and
pages must be supported right now in struct fuse_args_pages and
fuse_copy_pages() until all request types have been converted to use
folios. Once all have been converted, then
struct fuse_args_pages and fuse_copy_pages() will only support folios.
Right now in fuse, all folios are one page (large folios are not yet
supported). As such, copying folio->page is sufficient for copying
the entire folio in fuse_copy_pages().
No functional changes.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
When trying to insert a 10MB kernel module kept in a virtio-fs with cache
disabled, the following warning was reported:
------------[ cut here ]------------
WARNING: CPU: 1 PID: 404 at mm/page_alloc.c:4551 ......
Modules linked in:
CPU: 1 PID: 404 Comm: insmod Not tainted 6.9.0-rc5+ #123
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
RIP: 0010:__alloc_pages+0x2bf/0x380
......
Call Trace:
<TASK>
? __warn+0x8e/0x150
? __alloc_pages+0x2bf/0x380
__kmalloc_large_node+0x86/0x160
__kmalloc+0x33c/0x480
virtio_fs_enqueue_req+0x240/0x6d0
virtio_fs_wake_pending_and_unlock+0x7f/0x190
queue_request_and_unlock+0x55/0x60
fuse_simple_request+0x152/0x2b0
fuse_direct_io+0x5d2/0x8c0
fuse_file_read_iter+0x121/0x160
__kernel_read+0x151/0x2d0
kernel_read+0x45/0x50
kernel_read_file+0x1a9/0x2a0
init_module_from_file+0x6a/0xe0
idempotent_init_module+0x175/0x230
__x64_sys_finit_module+0x5d/0xb0
x64_sys_call+0x1c3/0x9e0
do_syscall_64+0x3d/0xc0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
......
</TASK>
---[ end trace 0000000000000000 ]---
The warning is triggered as follows:
1) syscall finit_module() handles the module insertion and it invokes
kernel_read_file() to read the content of the module first.
2) kernel_read_file() allocates a 10MB buffer by using vmalloc() and
passes it to kernel_read(). kernel_read() constructs a kvec iter by
using iov_iter_kvec() and passes it to fuse_file_read_iter().
3) virtio-fs disables the cache, so fuse_file_read_iter() invokes
fuse_direct_io(). As for now, the maximal read size for kvec iter is
only limited by fc->max_read. For virtio-fs, max_read is UINT_MAX, so
fuse_direct_io() doesn't split the 10MB buffer. It saves the address and
the size of the 10MB-sized buffer in out_args[0] of a fuse request and
passes the fuse request to virtio_fs_wake_pending_and_unlock().
4) virtio_fs_wake_pending_and_unlock() uses virtio_fs_enqueue_req() to
queue the request. Because virtiofs need DMA-able address, so
virtio_fs_enqueue_req() uses kmalloc() to allocate a bounce buffer for
all fuse args, copies these args into the bounce buffer and passed the
physical address of the bounce buffer to virtiofsd. The total length of
these fuse args for the passed fuse request is about 10MB, so
copy_args_to_argbuf() invokes kmalloc() with a 10MB size parameter and
it triggers the warning in __alloc_pages():
if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
return NULL;
5) virtio_fs_enqueue_req() will retry the memory allocation in a
kworker, but it won't help, because kmalloc() will always return NULL
due to the abnormal size and finit_module() will hang forever.
A feasible solution is to limit the value of max_read for virtio-fs, so
the length passed to kmalloc() will be limited. However it will affect
the maximal read size for normal read. And for virtio-fs write initiated
from kernel, it has the similar problem but now there is no way to limit
fc->max_write in kernel.
So instead of limiting both the values of max_read and max_write in
kernel, introducing use_pages_for_kvec_io in fuse_conn and setting it as
true in virtiofs. When use_pages_for_kvec_io is enabled, fuse will use
pages instead of pointer to pass the KVEC_IO data.
After switching to pages for KVEC_IO data, these pages will be used for
DMA through virtio-fs. If these pages are backed by vmalloc(),
{flush|invalidate}_kernel_vmap_range() are necessary to flush or
invalidate the cache before the DMA operation. So add two new fields in
fuse_args_pages to record the base address of vmalloc area and the
condition indicating whether invalidation is needed. Perform the flush
in fuse_get_user_pages() for write operations and the invalidation in
fuse_release_user_pages() for read operations.
It may seem necessary to introduce another field in fuse_conn to
indicate that these KVEC_IO pages are used for DMA, However, considering
that virtio-fs is currently the only user of use_pages_for_kvec_io, just
reuse use_pages_for_kvec_io to indicate that these pages will be used
for DMA.
Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Tested-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Introduce the capability to dynamically configure the max pages limit
(FUSE_MAX_MAX_PAGES) through a sysctl. This allows system administrators
to dynamically set the maximum number of pages that can be used for
servicing requests in fuse.
Previously, this is gated by FUSE_MAX_MAX_PAGES which is statically set
to 256 pages. One result of this is that the buffer size for a write
request is limited to 1 MiB on a 4k-page system.
The default value for this sysctl is the original limit (256 pages).
$ sysctl -a | grep max_pages_limit
fs.fuse.max_pages_limit = 256
$ sysctl -n fs.fuse.max_pages_limit
256
$ echo 1024 | sudo tee /proc/sys/fs/fuse/max_pages_limit
1024
$ sysctl -n fs.fuse.max_pages_limit
1024
$ echo 65536 | sudo tee /proc/sys/fs/fuse/max_pages_limit
tee: /proc/sys/fs/fuse/max_pages_limit: Invalid argument
$ echo 0 | sudo tee /proc/sys/fs/fuse/max_pages_limit
tee: /proc/sys/fs/fuse/max_pages_limit: Invalid argument
$ echo 65535 | sudo tee /proc/sys/fs/fuse/max_pages_limit
65535
$ sysctl -n fs.fuse.max_pages_limit
65535
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
We should convert fs/fuse code to use a newly introduced
invalid_mnt_idmap instead of passing a NULL as idmap pointer.
Suggested-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Let's convert all existing callers properly.
No functional changes intended.
Suggested-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Need to translate uid and gid in case of chown(2).
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
If idmap == NULL *and* filesystem daemon declared idmapped mounts
support, then uid/gid values in a fuse header will be -1.
No functional changes intended.
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
fuse_mount_list doesn't exist, use fuse_conn_list.
Signed-off-by: Aurelien Aptel <aaptel@nvidia.com>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Virtiofs has its own queuing mechanism, but still requests are first queued
on fiq->pending to be immediately dequeued and queued onto the virtio
queue.
The queuing on fiq->pending is unnecessary and might even have some
performance impact due to being a contention point.
Forget requests are handled similarly.
Move the queuing of requests and forgets into the fiq->ops->*.
fuse_iqueue_ops are renamed to reflect the new semantics.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Fixed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Tested-by: Peter-Jan Gootzen <pgootzen@nvidia.com>
Reviewed-by: Peter-Jan Gootzen <pgootzen@nvidia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
There is a confusion with fuse_file_uncached_io_{start,end} interface.
These helpers do two things when called from passthrough open()/release():
1. Take/drop negative refcount of fi->iocachectr (inode uncached io mode)
2. State change ff->iomode IOM_NONE <-> IOM_UNCACHED (file uncached open)
The calls from parallel dio write path need to take a reference on
fi->iocachectr, but they should not be changing ff->iomode state, because
in this case, the fi->iocachectr reference does not stick around until file
release().
Factor out helpers fuse_inode_uncached_io_{start,end}, to be used from
parallel dio write path and rename fuse_file_*cached_io_{start,end} helpers
to fuse_file_*cached_io_{open,release} to clarify the difference.
Fixes: 205c1d802683 ("fuse: allow parallel dio writes with FUSE_DIRECT_IO_ALLOW_MMAP")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Add passthrough mode for regular file I/O.
This allows performing read and write (also via memory maps) on a
backing file without incurring the overhead of roundtrips to
userspace. For now this is only allowed to privileged servers, but
this limitation will go away in the future (Amir Goldstein)
- Fix interaction of direct I/O mode with memory maps (Bernd Schubert)
- Export filesystem tags through sysfs for virtiofs (Stefan Hajnoczi)
- Allow resending queued requests for server crash recovery (Zhao Chen)
- Misc fixes and cleanups
* tag 'fuse-update-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (38 commits)
fuse: get rid of ff->readdir.lock
fuse: remove unneeded lock which protecting update of congestion_threshold
fuse: Fix missing FOLL_PIN for direct-io
fuse: remove an unnecessary if statement
fuse: Track process write operations in both direct and writethrough modes
fuse: Use the high bit of request ID for indicating resend requests
fuse: Introduce a new notification type for resend pending requests
fuse: add support for explicit export disabling
fuse: __kuid_val/__kgid_val helpers in fuse_fill_attr_from_inode()
fuse: fix typo for fuse_permission comment
fuse: Convert fuse_writepage_locked to take a folio
fuse: Remove fuse_writepage
virtio_fs: remove duplicate check if queue is broken
fuse: use FUSE_ROOT_ID in fuse_get_root_inode()
fuse: don't unhash root
fuse: fix root lookup with nonzero generation
fuse: replace remaining make_bad_inode() with fuse_make_bad()
virtiofs: drop __exit from virtio_fs_sysfs_exit()
fuse: implement passthrough for mmap
fuse: implement splice read/write passthrough
...
|
|
The same protection is provided by file->f_pos_lock.
Note, this relies on the fact that file->f_mode has FMODE_ATOMIC_POS.
This flag is cleared by stream_open(), which would prevent locking of
f_pos_lock.
Prior to commit 7de64d521bf9 ("fuse: break up fuse_open_common()")
FOPEN_STREAM on a directory would cause stream_open() to be called.
After this commit this is not done anymore, so f_pos_lock will always
be locked.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Our user space filesystem relies on fuse to provide POSIX interface.
In our test, a known string is written into a file and the content
is read back later to verify correct data returned. We observed wrong
data returned in read buffer in rare cases although correct data are
stored in our filesystem.
Fuse kernel module calls iov_iter_get_pages2() to get the physical
pages of the user-space read buffer passed in read(). The pages are
not pinned to avoid page migration. When page migration occurs, the
consequence are two-folds.
1) Applications do not receive correct data in read buffer.
2) fuse kernel writes data into a wrong place.
Using iov_iter_extract_pages() to pin pages fixes the issue in our
test.
An auxiliary variable "struct page **pt_pages" is used in the patch
to prepare the 2nd parameter for iov_iter_extract_pages() since
iov_iter_get_pages2() uses a different type for the 2nd parameter.
[SzM] add iov_iter_extract_will_pin(ii) and unpin only if true.
Signed-off-by: Lei Huang <lei.huang@linux.intel.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|