summaryrefslogtreecommitdiff
path: root/drivers/block
AgeCommit message (Collapse)Author
2025-11-06ublk: use copy_{to,from}_iter() for user copyCaleb Sander Mateos
ublk_copy_user_pages()/ublk_copy_io_pages() currently uses iov_iter_get_pages2() to extract the pages from the iov_iter and memcpy()s between the bvec_iter and the iov_iter's pages one at a time. Switch to using copy_to_iter()/copy_from_iter() instead. This avoids the user page reference count increments and decrements and needing to split the memcpy() at user page boundaries. It also simplifies the code considerably. Ming reports a 40% throughput improvement when issuing I/O to the selftests null ublk server with zero-copy disabled. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.18-rc5). Conflicts: drivers/net/wireless/ath/ath12k/mac.c 9222582ec524 ("Revert "wifi: ath12k: Fix missing station power save configuration"") 6917e268c433 ("wifi: ath12k: Defer vdev bring-up until CSA finalize to avoid stale beacon") https://lore.kernel.org/11cece9f7e36c12efd732baa5718239b1bf8c950.camel@sipsolutions.net Adjacent changes: drivers/net/ethernet/intel/Kconfig b1d16f7c0063 ("libie: depend on DEBUG_FS when building LIBIE_FWLOG") 93f53db9f9dc ("ice: switch to Page Pool") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-05rust: block: update ARef and AlwaysRefCounted imports from sync::arefShankari Anand
Update call sites in the block subsystem to import `ARef` and `AlwaysRefCounted` from `sync::aref` instead of `types`. This aligns with the ongoing effort to move `ARef` and `AlwaysRefCounted` to sync. Suggested-by: Benno Lossin <lossin@kernel.org> Link: https://github.com/Rust-for-Linux/linux/issues/1173 Signed-off-by: Shankari Anand <shankari.ak0208@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05block: introduce disk_report_zone()Damien Le Moal
Commit b76b840fd933 ("dm: Fix dm-zoned-reclaim zone write pointer alignment") introduced an indirect call for the callback function of a report zones executed with blkdev_report_zones(). This is necessary so that the function disk_zone_wplug_sync_wp_offset() can be called to refresh a zone write plug zone write pointer offset after a write error. However, this solution makes following the path of a zone information harder to understand. Clean this up by introducing the new blk_report_zones_args structure to define a zone report callback and its private data and introduce the helper function disk_report_zone() which calls both disk_zone_wplug_sync_wp_offset() and the zone report user callback function for all zones of a zone report. This helper function must be called by all block device drivers that implement the report zones block operation in order to correctly report a zone information. All block device drivers supporting the report_zones block operation are updated to use this new scheme. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-04net: Convert proto_ops connect() callbacks to use sockaddr_unsizedKees Cook
Update all struct proto_ops connect() callback function prototypes from "struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: Convert proto_ops bind() callbacks to use sockaddr_unsizedKees Cook
Update all struct proto_ops bind() callback function prototypes from "struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04nbd: don't copy kernel credsChristian Brauner
No need to copy kernel credentials. Link: https://patch.msgid.link/20251103-work-creds-init_cred-v1-6-cb3ec8711a6a@kernel.org Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03ublk: use struct_size() for allocationMing Lei
Convert ublk_queue to use struct_size() for allocation. Changes in this commit: 1. Update ublk_init_queue() to use struct_size(ubq, ios, depth) instead of manual size calculation (sizeof(struct ublk_queue) + depth * sizeof(struct ublk_io)). This provides better type safety and makes the code more maintainable by using standard kernel macro for flexible array handling. Meantime annotate ublk_queue.ios by __counted_by(). Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03ublk: implement NUMA-aware memory allocationMing Lei
Implement NUMA-friendly memory allocation for ublk driver to improve performance on multi-socket systems. This commit includes the following changes: 1. Rename __queues to queues, dropping the __ prefix since the field is now accessed directly throughout the codebase rather than only through the ublk_get_queue() helper. 2. Remove the queue_size field from struct ublk_device as it is no longer needed. 3. Move queue allocation and deallocation into ublk_init_queue() and ublk_deinit_queue() respectively, improving encapsulation. This simplifies ublk_init_queues() and ublk_deinit_queues() to just iterate and call the per-queue functions. 4. Add ublk_get_queue_numa_node() helper function to determine the appropriate NUMA node for a queue by finding the first CPU mapped to that queue via tag_set.map[HCTX_TYPE_DEFAULT].mq_map[] and converting it to a NUMA node using cpu_to_node(). This function is called internally by ublk_init_queue() to determine the allocation node. 5. Allocate each queue structure on its local NUMA node using kvzalloc_node() in ublk_init_queue(). 6. Allocate the I/O command buffer on the same NUMA node using alloc_pages_node(). This reduces memory access latency on multi-socket NUMA systems by ensuring each queue's data structures are local to the CPUs that access them. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03ublk: reorder tag_set initialization before queue allocationMing Lei
Move ublk_add_tag_set() before ublk_init_queues() in the device initialization path. This allows us to use the blk-mq CPU-to-queue mapping established by the tag_set to determine the appropriate NUMA node for each queue allocation. The error handling paths are also reordered accordingly. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03io_uring/uring_cmd: avoid double indirect call in task work dispatchCaleb Sander Mateos
io_uring task work dispatch makes an indirect call to struct io_kiocb's io_task_work.func field to allow running arbitrary task work functions. In the uring_cmd case, this calls io_uring_cmd_work(), which immediately makes another indirect call to struct io_uring_cmd's task_work_cb field. Change the uring_cmd task work callbacks to functions whose signatures match io_req_tw_func_t. Add a function io_uring_cmd_from_tw() to convert from the task work's struct io_tw_req argument to struct io_uring_cmd *. Define a constant IO_URING_CMD_TASK_WORK_ISSUE_FLAGS to avoid manufacturing issue_flags in the uring_cmd task work callbacks. Now uring_cmd task work dispatch makes a single indirect call to the uring_cmd implementation's callback. This also allows removing the task_work_cb field from struct io_uring_cmd, freeing up 8 bytes for future storage. Since fuse_uring_send_in_task() now has access to the io_tw_token_t, check its cancel field directly instead of relying on the IO_URING_F_TASK_DEAD issue flag. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03drbd: replace kmap() with kmap_local_page() in receiver pathShi Hao
Use kmap_local_page() instead of kmap() to avoid CPU contention. kmap() uses a global set of mapping slots that can cause contention between multiple CPUs, while kmap_local_page() uses per-CPU slots eliminating this contention. It also ensures non-sleeping operation and provides better cache locality. Convert kmap() to kmap_local_page() as it aligns with ongoing kernel efforts to modernize kmap() usage for better multi-core scalability. Signed-off-by: Shi Hao <i.shihao.999@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-31Merge tag 'block-6.18-20251031' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Fix blk-crypto reporting EIO when EINVAL is the correct error code - Two bug fixes for the block zone support - NVME pull request via Keith: - Target side authentication fixup - Peer-to-peer metadata fixup - null_blk DMA alignment fix * tag 'block-6.18-20251031' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: null_blk: set dma alignment to logical block size blk-crypto: use BLK_STS_INVAL for alignment errors block: make REQ_OP_ZONE_OPEN a write operation block: fix op_is_zone_mgmt() to handle REQ_OP_ZONE_RESET_ALL nvme-pci: use blk_map_iter for p2p metadata nvmet-auth: update sc_c in host response
2025-10-31null_blk: set dma alignment to logical block sizeHans Holmberg
This driver assumes that bio vectors are memory aligned to the logical block size, so set the queue limit to reflect that. Unless we set up the limit based on the logical block size, we will go out of page bounds in copy_to_nullb / copy_from_nullb. Apparently this wasn't noticed so far because none of the tests generate such buffers, but since commit 851c4c96db00 ("xfs: implement XFS_IOC_DIOINFO in terms of vfs_getattr") xfstests generates unaligned I/O, which now lead to memory corruption when using null_blk devices with 4k block size. Fixes: bf8d08532bc1 ("iomap: add support for dma aligned direct-io") Fixes: b1a000d3b8ec ("block: relax direct io memory alignment") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-24Merge tag 'block-6.18-20251023' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Fix dma alignment for PI - Fix selinux bogosity with nbd, where sendmsg would get rejected * tag 'block-6.18-20251023' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: block: require LBA dma_alignment when using PI nbd: override creds to kernel when calling sock_{send,recv}msg()
2025-10-20nbd: override creds to kernel when calling sock_{send,recv}msg()Ondrej Mosnacek
sock_{send,recv}msg() internally calls security_socket_{send,recv}msg(), which does security checks (e.g. SELinux) for socket access against the current task. However, _sock_xmit() in drivers/block/nbd.c may be called indirectly from a userspace syscall, where the NBD socket access would be incorrectly checked against the calling userspace task (which simply tries to read/write a file that happens to reside on an NBD device). To fix this, temporarily override creds to kernel ones before calling the sock_*() functions. This allows the security modules to recognize this as internal access by the kernel, which will normally be allowed. A way to trigger the issue is to do the following (on a system with SELinux set to enforcing): ### Create nbd device: truncate -s 256M /tmp/testfile nbd-server localhost:10809 /tmp/testfile ### Connect to the nbd server: nbd-client localhost ### Create mdraid array mdadm --create -l 1 -n 2 /dev/md/testarray /dev/nbd0 missing After these steps, assuming the SELinux policy doesn't allow the unexpected access pattern, errors will be visible on the kernel console: [ 142.204243] nbd0: detected capacity change from 0 to 524288 [ 165.189967] md: async del_gendisk mode will be removed in future, please upgrade to mdadm-4.5+ [ 165.252299] md/raid1:md127: active with 1 out of 2 mirrors [ 165.252725] md127: detected capacity change from 0 to 522240 [ 165.255434] block nbd0: Send control failed (result -13) [ 165.255718] block nbd0: Request send failed, requeueing [ 165.256006] block nbd0: Dead connection, failed to find a fallback [ 165.256041] block nbd0: Receive control failed (result -32) [ 165.256423] block nbd0: shutting down sockets [ 165.257196] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.257736] Buffer I/O error on dev md127, logical block 0, async page read [ 165.258263] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.259376] Buffer I/O error on dev md127, logical block 0, async page read [ 165.259920] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.260628] Buffer I/O error on dev md127, logical block 0, async page read [ 165.261661] ldm_validate_partition_table(): Disk read failed. [ 165.262108] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.262769] Buffer I/O error on dev md127, logical block 0, async page read [ 165.263697] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.264412] Buffer I/O error on dev md127, logical block 0, async page read [ 165.265412] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.265872] Buffer I/O error on dev md127, logical block 0, async page read [ 165.266378] I/O error, dev nbd0, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.267168] Buffer I/O error on dev md127, logical block 0, async page read [ 165.267564] md127: unable to read partition table [ 165.269581] I/O error, dev nbd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.269960] Buffer I/O error on dev nbd0, logical block 0, async page read [ 165.270316] I/O error, dev nbd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.270913] Buffer I/O error on dev nbd0, logical block 0, async page read [ 165.271253] I/O error, dev nbd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 [ 165.271809] Buffer I/O error on dev nbd0, logical block 0, async page read [ 165.272074] ldm_validate_partition_table(): Disk read failed. [ 165.272360] nbd0: unable to read partition table [ 165.289004] ldm_validate_partition_table(): Disk read failed. [ 165.289614] nbd0: unable to read partition table The corresponding SELinux denial on Fedora/RHEL will look like this (assuming it's not silenced): type=AVC msg=audit(1758104872.510:116): avc: denied { write } for pid=1908 comm="mdadm" laddr=::1 lport=32772 faddr=::1 fport=10809 scontext=system_u:system_r:mdadm_t:s0-s0:c0.c1023 tcontext=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 tclass=tcp_socket permissive=0 The respective backtrace looks like this: @security[mdadm, -13, handshake_exit+221615650 handshake_exit+221615650 handshake_exit+221616465 security_socket_sendmsg+5 sock_sendmsg+106 handshake_exit+221616150 sock_sendmsg+5 __sock_xmit+162 nbd_send_cmd+597 nbd_handle_cmd+377 nbd_queue_rq+63 blk_mq_dispatch_rq_list+653 __blk_mq_do_dispatch_sched+184 __blk_mq_sched_dispatch_requests+333 blk_mq_sched_dispatch_requests+38 blk_mq_run_hw_queue+239 blk_mq_dispatch_plug_list+382 blk_mq_flush_plug_list.part.0+55 __blk_flush_plug+241 __submit_bio+353 submit_bio_noacct_nocheck+364 submit_bio_wait+84 __blkdev_direct_IO_simple+232 blkdev_read_iter+162 vfs_read+591 ksys_read+95 do_syscall_64+92 entry_SYSCALL_64_after_hwframe+120 ]: 1 The issue has started to appear since commit 060406c61c7c ("block: add plug while submitting IO"). Cc: Ming Lei <ming.lei@redhat.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2348878 Fixes: 060406c61c7c ("block: add plug while submitting IO") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Acked-by: Paul Moore <paul@paul-moore.com> Acked-by: Stephen Smalley <stephen.smalley.work@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-20rnull: use `kernel::fmt`Tamir Duberstein
Reduce coupling to implementation details of the formatting machinery by avoiding direct use for `core`'s formatting traits and macros. This backslid in commit d969d504bc13 ("rnull: enable configuration via `configfs`") and commit 34585dc649fb ("rnull: add soft-irq completion support"). Acked-by: Andreas Hindborg <a.hindborg@kernel.org> Signed-off-by: Tamir Duberstein <tamird@gmail.com> Link: https://patch.msgid.link/20251018-cstr-core-v18-5-9378a54385f8@gmail.com Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2025-10-10Merge tag 'block-6.18-20251009' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Don't include __GFP_NOWARN for loop worker allocation, as it already uses GFP_NOWAIT which has __GFP_NOWARN set already - Small series cleaning up the recent bio_iov_iter_get_pages() changes - loop fix for leaking the backing reference file, if validation fails - Update of a comment pertaining to disk/partition stat locking * tag 'block-6.18-20251009' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: loop: remove redundant __GFP_NOWARN flag block: move bio_iov_iter_get_bdev_pages to block/fops.c iomap: open code bio_iov_iter_get_bdev_pages block: rename bio_iov_iter_get_pages_aligned to bio_iov_iter_get_pages block: remove bio_iov_iter_get_pages block: Update a comment of disk statistics loop: fix backing file reference leak on validation error
2025-10-08loop: remove redundant __GFP_NOWARN flagPedro Demarchi Gomes
GFP_NOWAIT already includes __GFP_NOWARN, so let's remove the redundant __GFP_NOWARN. Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-02Merge tag 'mm-stable-2025-10-01-19-00' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling * tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ...
2025-10-02loop: fix backing file reference leak on validation errorLi Chen
loop_change_fd() and loop_configure() call loop_check_backing_file() to validate the new backing file. If validation fails, the reference acquired by fget() was not dropped, leaking a file reference. Fix this by calling fput(file) before returning the error. Cc: stable@vger.kernel.org Cc: Markus Elfring <Markus.Elfring@web.de> CC: Yang Erkun <yangerkun@huawei.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Yu Kuai <yukuai1@huaweicloud.com> Fixes: f5c84eff634b ("loop: Add sanity check for read/write_iter") Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yang Erkun <yangerkun@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-02Merge tag 'for-6.18/block-20250929' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - NVMe pull request via Keith: - FC target fixes (Daniel) - Authentication fixes and updates (Martin, Chris) - Admin controller handling (Kamaljit) - Target lockdep assertions (Max) - Keep-alive updates for discovery (Alastair) - Suspend quirk (Georg) - MD pull request via Yu: - Add support for a lockless bitmap. A key feature for the new bitmap are that the IO fastpath is lockless. If a user issues lots of write IO to the same bitmap bit in a short time, only the first write has additional overhead to update bitmap bit, no additional overhead for the following writes. By supporting only resync or recover written data, means in the case creating new array or replacing with a new disk, there is no need to do a full disk resync/recovery. - Switch ->getgeo() and ->bios_param() to using struct gendisk rather than struct block_device. - Rust block changes via Andreas. This series adds configuration via configfs and remote completion to the rnull driver. The series also includes a set of changes to the rust block device driver API: a few cleanup patches, and a few features supporting the rnull changes. The series removes the raw buffer formatting logic from `kernel::block` and improves the logic available in `kernel::string` to support the same use as the removed logic. - floppy arch cleanups - Reduce the number of dereferencing needed for ublk commands - Restrict supported sockets for nbd. Mostly done to eliminate a class of issues perpetually reported by syzbot, by using nonsensical socket setups. - A few s390 dasd block fixes - Fix a few issues around atomic writes - Improve DMA interation for integrity requests - Improve how iovecs are treated with regards to O_DIRECT aligment constraints. We used to require each segment to adhere to the constraints, now only the request as a whole needs to. - Clean up and improve p2p support, enabling use of p2p for metadata payloads - Improve locking of request lookup, using SRCU where appropriate - Use page references properly for brd, avoiding very long RCU sections - Fix ordering of recursively submitted IOs - Clean up and improve updating nr_requests for a live device - Various fixes and cleanups * tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (164 commits) s390/dasd: enforce dma_alignment to ensure proper buffer validation s390/dasd: Return BLK_STS_INVAL for EINVAL from do_dasd_request ublk: remove redundant zone op check in ublk_setup_iod() nvme: Use non zero KATO for persistent discovery connections nvmet: add safety check for subsys lock nvme-core: use nvme_is_io_ctrl() for I/O controller check nvme-core: do ioccsz/iorcsz validation only for I/O controllers nvme-core: add method to check for an I/O controller blk-cgroup: fix possible deadlock while configuring policy blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path blk-mq: Fix more tag iteration function documentation selftests: ublk: fix behavior when fio is not installed ublk: don't access ublk_queue in ublk_unmap_io() ublk: pass ublk_io to __ublk_complete_rq() ublk: don't access ublk_queue in ublk_need_complete_req() ublk: don't access ublk_queue in ublk_check_commit_and_fetch() ublk: don't pass ublk_queue to ublk_fetch() ublk: don't access ublk_queue in ublk_config_io_buf() ublk: don't access ublk_queue in ublk_check_fetch_buf() ublk: pass q_id and tag to __ublk_check_and_get_req() ...
2025-10-02Merge tag 'for-6.18/io_uring-20250929' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Store ring provided buffers locally for the users, rather than stuff them into struct io_kiocb. These types of buffers must always be fully consumed or recycled in the current context, and leaving them in struct io_kiocb is hence not a good ideas as that struct has a vastly different life time. Basically just an architecture cleanup that can help prevent issues with ring provided buffers in the future. - Support for mixed CQE sizes in the same ring. Before this change, a CQ ring either used the default 16b CQEs, or it was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where a few 32b CQEs were needed, this caused everything else to use big CQEs. This is wasteful both in terms of memory usage, but also memory bandwidth for the posted CQEs. With IORING_SETUP_CQE_MIXED, applications may use request types that post both normal 16b and big 32b CQEs on the same ring. - Add helpers for async data management, to make it harder for opcode handlers to mess it up. - Add support for multishot for uring_cmd, which ublk can use. This helps improve efficiency, by providing a persistent request type that can trigger multiple CQEs. - Add initial support for ring feature querying. We had basic support for probe operations, but the API isn't great. Rather than expand that, add support for QUERY which is easily expandable and can cover a lot more cases than the existing probe support. This will help applications get a better idea of what operations are supported on a given host. - zcrx improvements from Pavel: - Improve refill entry alignment for better caching - Various cleanups, especially around deduplicating normal memory vs dmabuf setup. - Generalisation of the niov size (Patch 12). It's still hard coded to PAGE_SIZE on init, but will let the user to specify the rx buffer length on setup. - Syscall / synchronous bufer return. It'll be used as a slow fallback path for returning buffers when the refill queue is full. Useful for tolerating slight queue size misconfiguration or with inconsistent load. - Accounting more memory to cgroups. - Additional independent cleanups that will also be useful for mutli-area support. - Various fixes and cleanups * tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits) io_uring/cmd: drop unused res2 param from io_uring_cmd_done() io_uring: fix nvme's 32b cqes on mixed cq io_uring/query: cap number of queries io_uring/query: prevent infinite loops io_uring/zcrx: account niov arrays to cgroup io_uring/zcrx: allow synchronous buffer return io_uring/zcrx: introduce io_parse_rqe() io_uring/zcrx: don't adjust free cache space io_uring/zcrx: use guards for the refill lock io_uring/zcrx: reduce netmem scope in refill io_uring/zcrx: protect netdev with pp_lock io_uring/zcrx: rename dma lock io_uring/zcrx: make niov size variable io_uring/zcrx: set sgt for umem area io_uring/zcrx: remove dmabuf_offset io_uring/zcrx: deduplicate area mapping io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback() io_uring/zcrx: check all niovs filled with dma addresses io_uring/zcrx: move area reg checks into io_import_area io_uring/zcrx: don't pass slot to io_zcrx_create_area ...
2025-09-30Merge tag 'rust-6.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux Pull rust updates from Miguel Ojeda: "Toolchain and infrastructure: - Derive 'Zeroable' for all structs and unions generated by 'bindgen' where possible and corresponding cleanups. To do so, add the 'pin-init' crate as a dependency to 'bindings' and 'uapi'. It also includes its first use in the 'cpufreq' module, with more to come in the next cycle. - Add warning to the 'rustdoc' target to detect broken 'srctree/' links and fix existing cases. - Remove support for unused (since v6.16) host '#[test]'s, simplifying the 'rusttest' target. Tests should generally run within KUnit. 'kernel' crate: - Add 'ptr' module with a new 'Alignment' type, which is always a power of two and is used to validate that a given value is a valid alignment and to perform masking and alignment operations: // Checked at build time. assert_eq!(Alignment::new::<16>().as_usize(), 16); // Checked at runtime. assert_eq!(Alignment::new_checked(15), None); assert_eq!(Alignment::of::<u8>().log2(), 0); assert_eq!(0x25u8.align_down(Alignment::new::<0x10>()), 0x20); assert_eq!(0x5u8.align_up(Alignment::new::<0x10>()), Some(0x10)); assert_eq!(u8::MAX.align_up(Alignment::new::<0x10>()), None); It also includes its first use in Nova. - Add 'core::mem::{align,size}_of{,_val}' to the prelude, matching Rust 1.80.0. - Keep going with the steps on our migration to the standard library 'core::ffi::CStr' type (use 'kernel::{fmt, prelude::fmt!}' and use upstream method names). - 'error' module: improve 'Error::from_errno' and 'to_result' documentation, including examples/tests. - 'sync' module: extend 'aref' submodule documentation now that it exists, and more updates to complete the ongoing move of 'ARef' and 'AlwaysRefCounted' to 'sync::aref'. - 'list' module: add an example/test for 'ListLinksSelfPtr' usage. - 'alloc' module: - Implement 'Box::pin_slice()', which constructs a pinned slice of elements. - Provide information about the minimum alignment guarantees of 'Kmalloc', 'Vmalloc' and 'KVmalloc'. - Take minimum alignment guarantees of allocators for 'ForeignOwnable' into account. - Remove the 'allocator_test' (including 'Cmalloc'). - Add doctest for 'Vec::as_slice()'. - Constify various methods. - 'time' module: - Add methods on 'HrTimer' that can only be called with exclusive access to an unarmed timer, or from timer callback context. - Add arithmetic operations to 'Instant' and 'Delta'. - Add a few convenience and access methods to 'HrTimer' and 'Instant'. 'macros' crate: - Reduce collections in 'quote!' macro. And a few other cleanups and improvements" * tag 'rust-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: (58 commits) gpu: nova-core: use Alignment for alignment-related operations rust: add `Alignment` type rust: macros: reduce collections in `quote!` macro rust: acpi: use `core::ffi::CStr` method names rust: of: use `core::ffi::CStr` method names rust: net: use `core::ffi::CStr` method names rust: miscdevice: use `core::ffi::CStr` method names rust: kunit: use `core::ffi::CStr` method names rust: firmware: use `core::ffi::CStr` method names rust: drm: use `core::ffi::CStr` method names rust: cpufreq: use `core::ffi::CStr` method names rust: configfs: use `core::ffi::CStr` method names rust: auxiliary: use `core::ffi::CStr` method names drm/panic: use `core::ffi::CStr` method names rust: device: use `kernel::{fmt,prelude::fmt!}` rust: sync: use `kernel::{fmt,prelude::fmt!}` rust: seq_file: use `kernel::{fmt,prelude::fmt!}` rust: kunit: use `kernel::{fmt,prelude::fmt!}` rust: file: use `kernel::{fmt,prelude::fmt!}` rust: device: use `kernel::{fmt,prelude::fmt!}` ...
2025-09-24ublk: remove redundant zone op check in ublk_setup_iod()Caleb Sander Mateos
ublk_setup_iod() checks first whether the request is a zoned operation issued to a device without zoned support and returns BLK_STS_IOERR if so. However, such a request would already hit the default case in the subsequent switch statement and fail the ublk_queue_is_zoned() check, which also results in a return of BLK_STS_IOERR. So remove the redundant early check for unsupported zone ops. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-23io_uring/cmd: drop unused res2 param from io_uring_cmd_done()Caleb Sander Mateos
Commit 79525b51acc1 ("io_uring: fix nvme's 32b cqes on mixed cq") split out a separate io_uring_cmd_done32() helper for ->uring_cmd() implementations that return 32-byte CQEs. The res2 value passed to io_uring_cmd_done() is now unused because __io_uring_cmd_done() ignores it when is_cqe32 is passed as false. So drop the parameter from io_uring_cmd_done() to simplify the callers and clarify that it's not possible to return an extra value beyond the 32-bit CQE result. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-21aoe: stop calling page_address() in free_page()Vishal Moola (Oracle)
free_page() should be used when we only have a virtual address. We should call __free_page() directly on our page instead. Link: https://lkml.kernel.org/r/20250903185921.1785167-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Justin Sanders <justin@coraid.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-21Merge branch 'mm-hotfixes-stable' into mm-stable in order to pick upAndrew Morton
changes required by mm-stable material: hugetlb and damon.
2025-09-20ublk: don't access ublk_queue in ublk_unmap_io()Caleb Sander Mateos
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_unmap_io() is a frequent cache miss. Pass to __ublk_complete_rq() whether the ublk server's data buffer needs to be copied to the request. In the callers __ublk_fail_req() and ublk_ch_uring_cmd_local(), get the flags from the ublk_device instead, as its flags have just been read. In ublk_put_req_ref(), pass false since all the features that require reference counting disable copying of the data buffer upon completion. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: pass ublk_io to __ublk_complete_rq()Caleb Sander Mateos
All callers of __ublk_complete_rq() already know the ublk_io. Pass it in to avoid looking it up again. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_need_complete_req()Caleb Sander Mateos
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_need_complete_req() is a frequent cache miss. Get the flags from the ublk_device instead, which is accessed earlier in ublk_ch_uring_cmd_local(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_check_commit_and_fetch()Caleb Sander Mateos
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_check_commit_and_fetch() is a frequent cache miss. Get the flags from the ublk_device instead, which is accessed earlier in ublk_ch_uring_cmd_local(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't pass ublk_queue to ublk_fetch()Caleb Sander Mateos
ublk_fetch() only uses the ublk_queue to get the ublk_device, which its caller already has. So just pass the ublk_device directly. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_config_io_buf()Caleb Sander Mateos
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_config_io_buf() is a frequent cache miss. Get the flags from the ublk_device instead, which is accessed earlier in ublk_ch_uring_cmd_local(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_check_fetch_buf()Caleb Sander Mateos
Obtain the ublk device flags from ublk_device to avoid needing to access the ublk_queue, which may be a cache miss. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: pass q_id and tag to __ublk_check_and_get_req()Caleb Sander Mateos
__ublk_check_and_get_req() only uses its ublk_queue argument to get the q_id and tag. Pass those arguments explicitly to save an access to the ublk_queue. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_daemon_register_io_buf()Caleb Sander Mateos
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_daemon_register_io_buf() is a frequent cache miss. Get the flags from the ublk_device instead, which is accessed earlier in ublk_ch_uring_cmd_local(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't access ublk_queue in ublk_register_io_buf()Caleb Sander Mateos
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_register_io_buf() is a frequent cache miss. Get the flags from the ublk_device instead, which is accessed earlier in ublk_ch_uring_cmd_local(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: pass ublk_device to ublk_register_io_buf()Caleb Sander Mateos
Avoid repeating the 2 dereferences to get the ublk_device from the io_uring_cmd by passing it from ublk_ch_uring_cmd_local() to ublk_register_io_buf(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't dereference ublk_queue in ublk_check_and_get_req()Caleb Sander Mateos
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_ch_{read,write}_iter() is a frequent cache miss. Get the flags and queue depth from the ublk_device instead, which is accessed just before. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't dereference ublk_queue in ublk_ch_uring_cmd_local()Caleb Sander Mateos
For ublk servers with many ublk queues, accessing the ublk_queue to handle a ublk command is a frequent cache miss. Get the queue depth from the ublk_device instead, which is accessed just before. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: add helpers to check ublk_device flagsCaleb Sander Mateos
Introduce ublk_device analogues of the ublk_queue flag helpers: - ublk_support_zero_copy() -> ublk_dev_support_user_copy() - ublk_support_auto_buf_reg() -> ublk_dev_support_auto_buf_reg() - ublk_support_user_copy() -> ublk_dev_support_user_copy() - ublk_need_map_io() -> ublk_dev_need_map_io() - ublk_need_req_ref() -> ublk_dev_need_req_ref() - ublk_need_get_data() -> ublk_dev_need_get_data() These will be used in subsequent changes to avoid accessing the ublk_queue just for the flags, and instead use the ublk_device. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't pass ublk_queue to __ublk_fail_req()Caleb Sander Mateos
__ublk_fail_req() only uses the ublk_queue to get the ublk_device, which its caller already has. So just pass the ublk_device directly. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't pass q_id to ublk_queue_cmd_buf_size()Caleb Sander Mateos
ublk_queue_cmd_buf_size() only needs the queue depth, which is the same for all queues. Get the queue depth from the ublk_device instead so the q_id parameter can be dropped. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: remove ubq check in ublk_check_and_get_req()Caleb Sander Mateos
ublk_get_queue() never returns a NULL pointer, so there's no need to check its return value in ublk_check_and_get_req(). Drop the check. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-19Merge tag 'block-6.17-20250918' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: "A set of fixes for an issue with md array assembly and drbd for devices supporting write zeros" * tag 'block-6.17-20250918' of git://git.kernel.dk/linux: drbd: init queue_limits->max_hw_wzeroes_unmap_sectors parameter md: init queue_limits->max_hw_wzeroes_unmap_sectors parameter
2025-09-17drbd: init queue_limits->max_hw_wzeroes_unmap_sectors parameterZhang Yi
The parameter max_hw_wzeroes_unmap_sectors in queue_limits should be equal to max_write_zeroes_sectors if it is set to a non-zero value. However, when the backend bdev is specified, this parameter is initialized to UINT_MAX during the call to blk_set_stacking_limits(), while only max_write_zeroes_sectors is adjusted. Therefore, this discrepancy triggers a value check failure in blk_validate_limits(). Since the drvd driver doesn't yet support unmap write zeroes, so fix this failure by explicitly setting max_hw_wzeroes_unmap_sectors to zero. Fixes: 0c40d7cb5ef3 ("block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16rust: block: use `kernel::{fmt,prelude::fmt!}`Tamir Duberstein
Reduce coupling to implementation details of the formatting machinery by avoiding direct use for `core`'s formatting traits and macros. Suggested-by: Alice Ryhl <aliceryhl@google.com> Link: https://rust-for-linux.zulipchat.com/#narrow/channel/288089-General/topic/Custom.20formatting/with/516476467 Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Benno Lossin <lossin@kernel.org> Signed-off-by: Tamir Duberstein <tamird@gmail.com> Acked-by: Andreas Hindborg <a.hindborg@kernel.org> Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2025-09-15zram: fix slot write race conditionSergey Senozhatsky
Parallel concurrent writes to the same zram index result in leaked zsmalloc handles. Schematically we can have something like this: CPU0 CPU1 zram_slot_lock() zs_free(handle) zram_slot_lock() zram_slot_lock() zs_free(handle) zram_slot_lock() compress compress handle = zs_malloc() handle = zs_malloc() zram_slot_lock zram_set_handle(handle) zram_slot_lock zram_slot_lock zram_set_handle(handle) zram_slot_lock Either CPU0 or CPU1 zsmalloc handle will leak because zs_free() is done too early. In fact, we need to reset zram entry right before we set its new handle, all under the same slot lock scope. Link: https://lkml.kernel.org/r/20250909045150.635345-1-senozhatsky@chromium.org Fixes: 71268035f5d7 ("zram: free slot memory early during write") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reported-by: Changhui Zhong <czhong@redhat.com> Closes: https://lore.kernel.org/all/CAGVVp+UtpGoW5WEdEU7uVTtsSCjPN=ksN6EcvyypAtFDOUf30A@mail.gmail.com/ Tested-by: Changhui Zhong <czhong@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13zram: protect recomp_algorithm_show() with ->init_lockSergey Senozhatsky
sysfs handlers should be called under ->init_lock and are not supposed to unlock it until return, otherwise e.g. a concurrent reset() can occur. There is one handler that breaks that rule: recomp_algorithm_show(). Move ->init_lock handling outside of __comp_algorithm_show() (also drop it and call zcomp_available_show() directly) so that the entire recomp_algorithm_show() loop is protected by the lock, as opposed to protecting individual iterations. The patch does not need to go to -stable, as it does not fix any runtime errors (at least I can't think of any). It makes recomp_algorithm_show() "atomic" w.r.t. zram reset() (just like the rest of zram sysfs show() handlers), that's a pretty minor change. Link: https://lkml.kernel.org/r/20250805101946.1774112-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reported-by: Seyediman Seyedarab <imandevel@gmail.com> Suggested-by: Seyediman Seyedarab <imandevel@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>