summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
7 daysMerge tag 'for-6.19/block-20251201' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - Fix head insertion for mq-deadline, a regression from when priority support was added - Series simplifying and improving the ublk user copy code - Various ublk related cleanups - Fixup REQ_NOWAIT handling in loop/zloop, clearing NOWAIT when the request is punted to a thread for handling - Merge and then later revert loop dio nowait support, as it ended up causing excessive stack usage for when the inline issue code needs to dip back into the full file system code - Improve auto integrity code, making it less deadlock prone - Speedup polled IO handling, but manually managing the hctx lookups - Fixes for blk-throttle for SSD devices - Small series with fixes for the S390 dasd driver - Add support for caching zones, avoiding unnecessary report zone queries - MD pull requests via Yu: - fix null-ptr-dereference regression for dm-raid0 - fix IO hang for raid5 when array is broken with IO inflight - remove legacy 1s delay to speed up system shutdown - change maintainer's email address - data can be lost if array is created with different lbs devices, fix this problem and record lbs of the array in metadata - fix rcu protection for md_thread - fix mddev kobject lifetime regression - enable atomic writes for md-linear - some cleanups - bcache updates via Coly - remove useless discard and cache device code - improve usage of per-cpu workqueues - Reorganize the IO scheduler switching code, fixing some lockdep reports as well - Improve the block layer P2P DMA support - Add support to the block tracing code for zoned devices - Segment calculation improves, and memory alignment flexibility improvements - Set of prep and cleanups patches for ublk batching support. The actual batching hasn't been added yet, but helps shrink down the workload of getting that patchset ready for 6.20 - Fix for how the ps3 block driver handles segments offsets - Improve how block plugging handles batch tag allocations - nbd fixes for use-after-free of the configuration on device clear/put - Set of improvements and fixes for zloop - Add Damien as maintainer of the block zoned device code handling - Various other fixes and cleanups * tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits) block/rnbd: correct all kernel-doc complaints blk-mq: use queue_hctx in blk_mq_map_queue_type md: remove legacy 1s delay in md_notify_reboot md/raid5: fix IO hang when array is broken with IO inflight md: warn about updating super block failure md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid sbitmap: fix all kernel-doc warnings ublk: add helper of __ublk_fetch() ublk: pass const pointer to ublk_queue_is_zoned() ublk: refactor auto buffer register in ublk_dispatch_req() ublk: add `union ublk_io_buf` with improved naming ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg() kfifo: add kfifo_alloc_node() helper for NUMA awareness blk-mq: fix potential uaf for 'queue_hw_ctx' blk-mq: use array manage hctx map instead of xarray ublk: prevent invalid access with DEBUG s390/dasd: Use scnprintf() instead of sprintf() s390/dasd: Move device name formatting into separate function s390/dasd: Remove unnecessary debugfs_create() return checks s390/dasd: Fix gendisk parent after copy pair swap ...
8 daysMerge tag 'core-core-2025-12-03' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core irq cleanup from Thomas Gleixner: "Tree wide cleanup of the remaining users of in_irq() which got replaced by in_hardirq() and marked deprecated in 2020" * tag 'core-core-2025-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: treewide: Remove in_irq()
11 daysmd: remove legacy 1s delay in md_notify_rebootTarun Sahu
During system shutdown, the md driver registered notifier function (md_notify_reboot) currently imposes a hardcoded one-second delay. This delay was introduced approximately 23 years ago and was likely necessary for the hardware generation of that time. Proposing this patch to make sure there are no known devices that need this delay. Link: https://lore.kernel.org/linux-raid/20251121191422.2758555-1-tarunsahu@google.com Signed-off-by: Tarun Sahu <tarunsahu@google.com> Reviewed-by: Yu Kuai<yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
11 daysmd/raid5: fix IO hang when array is broken with IO inflightYu Kuai
Following test can cause IO hang: mdadm -CvR /dev/md0 -l10 -n4 /dev/sd[abcd] --assume-clean --chunk=64K --bitmap=none sleep 5 echo 1 > /sys/block/sda/device/delete echo 1 > /sys/block/sdb/device/delete echo 1 > /sys/block/sdc/device/delete echo 1 > /sys/block/sdd/device/delete dd if=/dev/md0 of=/dev/null bs=8k count=1 iflag=direct Root cause: 1) all disks removed, however all rdevs in the array is still in sync, IO will be issued normally. 2) IO failure from sda, and set badblocks failed, sda will be faulty and MD_SB_CHANGING_PENDING will be set. 3) error recovery try to recover this IO from other disks, IO will be issued to sdb, sdc, and sdd. 4) IO failure from sdb, and set badblocks failed again, now array is broken and will become read-only. 5) IO failure from sdc and sdd, however, stripe can't be handled anymore because MD_SB_CHANGING_PENDING is set: handle_stripe handle_stripe if (test_bit MD_SB_CHANGING_PENDING) set_bit STRIPE_HANDLE goto finish // skip handling failed stripe release_stripe if (test_bit STRIPE_HANDLE) list_add_tail conf->hand_list 6) later raid5d can't handle failed stripe as well: raid5d md_check_recovery md_update_sb if (!md_is_rdwr()) // can't clear pending bit return if (test_bit MD_SB_CHANGING_PENDING) break; // can't handle failed stripe Since MD_SB_CHANGING_PENDING can never be cleared for read-only array, fix this problem by skip this checking for read-only array. Link: https://lore.kernel.org/linux-raid/20251117085557.770572-3-yukuai@fnnas.com Fixes: d87f064f5874 ("md: never update metadata when array is read-only.") Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>
11 daysmd: warn about updating super block failureYu Kuai
Many personalities will handle IO error from daemon thread(like raid1d, raid10d, raid5d), and sb will require to be clean before hanlding these failed IO. However update sb can fail, for example array is broken by IO failure, or user config sysfs api array_state. This patch adds warning if updating sb failed first, in case this will be related to IO hang. Link: https://lore.kernel.org/linux-raid/20251117085557.770572-2-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>
11 daysmd/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raidYu Kuai
Commit 2107457e31fa ("md/raid0: Move queue limit setup before r0conf initialization") dereference mddev->gendisk unconditionally, which is NULL for dm-raid. Fix this problem by reverting to old codes for dm-raid. Link: https://lore.kernel.org/linux-raid/20251116021816.107648-1-yukuai@fnnas.com Fixes: 2107457e31fa ("md/raid0: Move queue limit setup before r0conf initialization") Reported-and-tested-by: Changhui Zhong <czhong@redhat.com> Closes: https://lore.kernel.org/all/CAGVVp+VqVnvGeneUoTbYvBv2cw6GwQRrR3B-iQ-_9rVfyumoKA@mail.gmail.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Li Nan <linan122@huawei.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
2025-11-21dm-verity: fix unreliable memory allocationMikulas Patocka
GFP_NOWAIT allocation may fail anytime. It needs to be changed to GFP_NOIO. There's no need to handle an error because mempool_alloc with GFP_NOIO can't fail. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Reviewed-by: Eric Biggers <ebiggers@kernel.org>
2025-11-20dm: fix failure when empty flush's bi_sector points beyond the device endMikulas Patocka
An empty flush bio can have arbitrary bi_sector. The commit 2b1c6d7a890a introduced a regression that device mapper would fail an empty flush bio with -EIO if the sector pointed beyond the end of the device. The commit introduced an optimization, that optimization would pass flushes to __split_and_process_bio and __split_and_process_bio is not prepared to handle empty bios. Fix this bug by passing only non-empty flushes to __split_and_process_bio - non-empty flushes must have valid bi_sector. Empty bios will go through __send_empty_flush, as they did before the optimization. This problem can be reproduced by running the lvm2 test: make check_local T=lvconvert-thin.sh LVM_TEST_PREFER_BRD=0 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 2b1c6d7a890a ("dm: optimize REQ_PREFLUSH with data when using the linear target") Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
2025-11-18dm-pcache: zero cache_info before default initLi Chen
pcache_meta_find_latest() leaves whatever it last copied into the caller’s buffer even when it returns NULL. For cache_info_init(), that meant cache->cache_info could still contain CRC-bad garbage when no valid metadata exists, leading later initialization paths to read bogus flags. Explicitly memset cache->cache_info in cache_info_init_default() so new-cache paths start from a clean slate. The default sequence number assignment becomes redundant with this reset, so it drops out. Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Reviewed-by: Zheng Gu <cengku@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-11-18dm-pcache: reuse meta_addr in pcache_meta_find_latestLi Chen
pcache_meta_find_latest() already computes the metadata address as meta_addr. Reuse that instead of recomputing. Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-11-18dm-pcache: allow built-in build and rename flush helperLi Chen
CONFIG_BCACHE is tristate, so dm-pcache can also be built-in. Switch the Makefile to use obj-$(CONFIG_DM_PCACHE) so the target can be linked into vmlinux instead of always being a loadable module. Also rename cache_flush() to pcache_cache_flush() to avoid a global symbol clash with sunrpc/cache.c's cache_flush(). Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-11-13bcache: Avoid -Wflex-array-member-not-at-end warningGustavo A. R. Silva
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. Use the new TRAILING_OVERLAP() helper to fix the following warning: drivers/md/bcache/bset.h:330:27: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] This helper creates a union between a flexible-array member (FAM) and a set of MEMBERS that would otherwise follow it. This overlays the trailing MEMBER struct btree_iter_set stack_data[MAX_BSETS]; onto the FAM struct btree_iter::data[], while keeping the FAM and the start of MEMBER aligned. The static_assert() ensures this alignment remains, and it's intentionally placed immediately after the corresponding structures --no blank line in between. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Coly Li <colyli@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13bcache: WQ_PERCPU added to alloc_workqueue usersMarco Crivellari
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This patch continues the effort to refactor worqueue APIs, which has begun with the change introducing new workqueues and a new alloc_workqueue flag: commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq") commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag") This change adds a new WQ_PERCPU flag to explicitly request alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Coly Li <colyli@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13bcache: replace use of system_wq with system_percpu_wqMarco Crivellari
Currently if a user enqueues a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistency cannot be addressed without refactoring the API. This patch continues the effort to refactor worqueue APIs, which has begun with the change introducing new workqueues and a new alloc_workqueue flag: commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq") commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag") system_wq should be the per-cpu workqueue, yet in this name nothing makes that clear, so replace system_wq with system_percpu_wq. The old wq (system_wq) will be kept for a few release cycles. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Coly Li <colyli@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13bcache: remove redundant __GFP_NOWARNQianfeng Rong
GFP_NOWAIT already includes __GFP_NOWARN, so let's remove the redundant __GFP_NOWARN. Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Acked-by: Coly Li <colyli@fnnas.com> Acked-by: Coly Li <colyli@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13bcache: reduce gc latency by processing less nodes and sleep less timeColy Li
When bcache device is busy for high I/O loads, there are two methods to reduce the garbage collection latency, - Process less nodes in eac loop of incremental garbage collection in btree_gc_recurse(). - Sleep less time between two full garbage collection in bch_btree_gc(). This patch introduces to hleper routines to provide different garbage collection nodes number and sleep intervel time. - btree_gc_min_nodes() If there is no front end I/O, return 128 nodes to process in each incremental loop, otherwise only 10 nodes are returned. Then front I/O is able to access the btree earlier. - btree_gc_sleep_ms() If there is no synchronized wait for bucket allocation, sleep 100 ms between two incremental GC loop. Othersize only sleep 10 ms before incremental GC loop. Then a faster GC may provide available buckets earlier, to avoid most of bcache working threads from being starved by buckets allocation. The idea is inspired by works from Mingzhe Zou and Robert Pang, but much simpler and the expected behavior is more predictable. Signed-off-by: Coly Li <colyli@fnnas.com> Signed-off-by: Robert Pang <robertpang@google.com> Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13bcache: drop discard sysfs interfaceColy Li
Since discard code is removed, now the sysfs interface to enable discard is useless. This patch removes the corresponding sysfs entry, and remove bool variable 'discard' from struct cache as well. Signed-off-by: Coly Li <colyli@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13bcache: remove discard code from alloc.cColy Li
Bcache allocator initially has no free space to allocate. Firstly it does a garbage collection which is triggered by a cache device write and fills free space into ca->free[] lists. The discard happens after the free bucket is handled by garbage collection added into one of the ca->free[] lists. But normally this bucket will be allocated out very soon to requester and filled data onto it. The discard hint on this bucket LBA range doesn't help SSD control to improve internal erasure performance, and waste extra CPU cycles to issue discard bios. This patch removes the almost-useless discard code from alloc.c. Signed-off-by: Coly Li <colyli@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13bcache: get rid of discard code from journalColy Li
In bcache journal there is discard functionality but almost useless in reality. Because discard happens after a journal bucket is reclaimed, and the reclaimed bucket is allocated for new journaling immediately. There is no time for underlying SSD to use the discard hint for internal data management. The discard code in bcache journal doesn't bring any performance optimization and wastes CPU cycles for issuing discard bios. Therefore this patch gits rid of it from journal.c and journal.h. Signed-off-by: Coly Li <colyli@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-13dm: fix zone reset all operation processingDamien Le Moal
dm_zone_get_reset_bitmap() is used to generate a bitmap of the zones of a zoned device target when a REQ_OP_ZONE_RESET_ALL request is being processed. This bitmap is built by executing a zone report with a report callback set to the function dm_zone_need_reset_cb() in struct dm_report_zones_args. However, the cb callback pointer is not anymore the same as the callback specified by callers of the blkdev_report_zones() function. Rather, this is a DM internal callback and report zones callback functions from blkdev_report_zones() are passed using struct blk_report_zones_args, introduced with commit db9aed869f34 ("block: introduce disk_report_zone()"). This commit changed the DM main report zones callback handler function dm_report_zones_cb() to call the new disk_report_zone() so that callback functions from blkdev_report_zones() are executed, and this change resulted in the DM internal dm_zone_need_reset_cb() callback function to not be executed anymore, turning any REQ_OP_ZONE_RESET_ALL request into a no-op. Fix this by calling in dm_report_zones_cb() the DM internal cb function specified in struct dm_report_zones_args. Fixes: db9aed869f34 ("block: introduce disk_report_zone()"). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-11md: allow configuring logical block sizeLi Nan
Previously, raid array used the maximum logical block size (LBS) of all member disks. Adding a larger LBS disk at runtime could unexpectedly increase RAID's LBS, risking corruption of existing partitions. This can be reproduced by: ``` # LBS of sd[de] is 512 bytes, sdf is 4096 bytes. mdadm -CRq /dev/md0 -l1 -n3 /dev/sd[de] missing --assume-clean # LBS is 512 cat /sys/block/md0/queue/logical_block_size # create partition md0p1 parted -s /dev/md0 mklabel gpt mkpart primary 1MiB 100% lsblk | grep md0p1 # LBS becomes 4096 after adding sdf mdadm --add -q /dev/md0 /dev/sdf cat /sys/block/md0/queue/logical_block_size # partition lost partprobe /dev/md0 lsblk | grep md0p1 ``` Simply restricting larger-LBS disks is inflexible. In some scenarios, only disks with 512 bytes LBS are available currently, but later, disks with 4KB LBS may be added to the array. Making LBS configurable is the best way to solve this scenario. After this patch, the raid will: - store LBS in disk metadata - add a read-write sysfs 'mdX/logical_block_size' Future mdadm should support setting LBS via metadata field during RAID creation and the new sysfs. Though the kernel allows runtime LBS changes, users should avoid modifying it after creating partitions or filesystems to prevent compatibility issues. Only 1.x metadata supports configurable LBS. 0.90 metadata inits all fields to default values at auto-detect. Supporting 0.90 would require more extensive changes and no such use case has been observed. Note that many RAID paths rely on PAGE_SIZE alignment, including for metadata I/O. A larger LBS than PAGE_SIZE will result in metadata read/write failures. So this config should be prevented. Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-6-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11md: add check_new_feature module parameterLi Nan
Raid checks if pad3 is zero when loading superblock from disk. Arrays created with new features may fail to assemble on old kernels as pad3 is used. Add module parameter check_new_feature to bypass this check. Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-5-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11md/raid0: Move queue limit setup before r0conf initializationLi Nan
Prepare for making logical blocksize configurable. This change has no impact until logical block size becomes configurable. Move raid0_set_limits() before create_strip_zones(). It is safe as fields modified in create_strip_zones() do not involve mddev configuration, and rdev modifications there are not used in raid0_set_limits(). 'blksize' in create_strip_zones() fetches mddev's logical block size, which is already the maximum aross all rdevs, so the later max() can be removed. Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-4-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11md: init bioset in mddev_initLi Nan
IO operations may be needed before md_run(), such as updating metadata after writing sysfs. Without bioset, this triggers a NULL pointer dereference as below: BUG: kernel NULL pointer dereference, address: 0000000000000020 Call Trace: md_update_sb+0x658/0xe00 new_level_store+0xc5/0x120 md_attr_store+0xc9/0x1e0 sysfs_kf_write+0x6f/0xa0 kernfs_fop_write_iter+0x141/0x2a0 vfs_write+0x1fc/0x5a0 ksys_write+0x79/0x180 __x64_sys_write+0x1d/0x30 x64_sys_call+0x2818/0x2880 do_syscall_64+0xa9/0x580 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Reproducer ``` mdadm -CR /dev/md0 -l1 -n2 /dev/sd[cd] echo inactive > /sys/block/md0/md/array_state echo 10 > /sys/block/md0/md/new_level ``` mddev_init() can only be called once per mddev, no need to test if bioset has been initialized anymore. Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-3-linan666@huaweicloud.com Fixes: d981ed841930 ("md: Add new_level sysfs interface") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11md: delete md_redundancy_group when array is becoming inactiveLi Nan
'md_redundancy_group' are created in md_run() and deleted in del_gendisk(), but these are not paired. Writing inactive/active to sysfs array_state can trigger md_run() multiple times without del_gendisk(), leading to duplicate creation as below: sysfs: cannot create duplicate filename '/devices/virtual/block/md0/md/sync_action' Call Trace: dump_stack_lvl+0x9f/0x120 dump_stack+0x14/0x20 sysfs_warn_dup+0x96/0xc0 sysfs_add_file_mode_ns+0x19c/0x1b0 internal_create_group+0x213/0x830 sysfs_create_group+0x17/0x20 md_run+0x856/0xe60 ? __x64_sys_openat+0x23/0x30 do_md_run+0x26/0x1d0 array_state_store+0x559/0x760 md_attr_store+0xc9/0x1e0 sysfs_kf_write+0x6f/0xa0 kernfs_fop_write_iter+0x141/0x2a0 vfs_write+0x1fc/0x5a0 ksys_write+0x79/0x180 __x64_sys_write+0x1d/0x30 x64_sys_call+0x2818/0x2880 do_syscall_64+0xa9/0x580 entry_SYSCALL_64_after_hwframe+0x4b/0x53 md: cannot register extra attributes for md0 Creation of it depends on 'pers', its lifecycle cannot be aligned with gendisk. So fix this issue by triggering 'md_redundancy_group' deletion when the array is becoming inactive. Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-2-linan666@huaweicloud.com Fixes: 790abe4d77af ("md: remove/add redundancy group only in level change") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11md: prevent adding disks with larger logical_block_size to active arraysLi Nan
When adding a disk to a md array, avoid updating the array's logical_block_size to match the new disk. This prevents accidental partition table loss that renders the array unusable. The later patch will introduce a way to configure the array's logical_block_size. The issue was introduced before Linux 2.6.12-rc2. Link: https://lore.kernel.org/linux-raid/20250918115759.334067-2-linan666@huaweicloud.com/ Fixes: d2e45eace8 ("[PATCH] Fix raid "bio too big" failures") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08md/raid5: remove redundant __GFP_NOWARNHuiwen He
The __GFP_NOWARN flag was included in GFP_NOWAIT since commit 16f5dfbc851b ("gfp: include __GFP_NOWARN in GFP_NOWAIT"). So remove the redundant __GFP_NOWARN flag. Link: https://lore.kernel.org/linux-raid/20251102152540.871568-1-hehuiwen@kylinos.cn Signed-off-by: Huiwen He <hehuiwen@kylinos.cn> Reviewed-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08md: avoid repeated calls to del_gendiskXiao Ni
There is a uaf problem which is found by case 23rdev-lifetime: Oops: general protection fault, probably for non-canonical address 0xdead000000000122 RIP: 0010:bdi_unregister+0x4b/0x170 Call Trace: <TASK> __del_gendisk+0x356/0x3e0 mddev_unlock+0x351/0x360 rdev_attr_store+0x217/0x280 kernfs_fop_write_iter+0x14a/0x210 vfs_write+0x29e/0x550 ksys_write+0x74/0xf0 do_syscall_64+0xbb/0x380 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7ff5250a177e The sequence is: 1. rdev remove path gets reconfig_mutex 2. rdev remove path release reconfig_mutex in mddev_unlock 3. md stop calls do_md_stop and sets MD_DELETED 4. rdev remove path calls del_gendisk because MD_DELETED is set 5. md stop path release reconfig_mutex and calls del_gendisk again So there is a race condition we should resolve. This patch adds a flag MD_DO_DELETE to avoid the race condition. Link: https://lore.kernel.org/linux-raid/20251029063419.21700-1-xni@redhat.com Fixes: 9e59d609763f ("md: call del_gendisk in control path") Signed-off-by: Xiao Ni <xni@redhat.com> Suggested-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08md/md-llbitmap: Remove unneeded semicolonChen Ni
Remove unnecessary semicolons reported by Coccinelle/coccicheck and the semantic patch at scripts/coccinelle/misc/semicolon.cocci. Link: https://lore.kernel.org/linux-raid/20250910091912.25624-1-nichen@iscas.ac.cn Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08md/md-linear: Enable atomic writesJohn Garry
All the infrastructure has already been plumbed to support this for stacked devices, so just enable the request_queue limits features flag. A note about chunk sectors for linear arrays: While it is possible to set a chunk sectors param for building a linear array, this is for specifying the granularity at which data sectors from the device are used. It is not the same as a stripe size, like for RAID0. As such, it is not appropriate to set chunk_sectors request queue limit to the same value, as chunk_sectors request limit is a boundary for which requests cannot straddle. However, request_queue limit max_hw_sectors is set to chunk sectors, which almost has the same effect as setting chunk_sectors limit. Link: https://lore.kernel.org/linux-raid/20250903161052.3326176-1-john.g.garry@oracle.com Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Yu Kuai <yukuai3@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08Factor out code into md_should_do_recovery()Wu Guanghao
In md_check_recovery(), use new helper to make code cleaner. Link: https://lore.kernel.org/linux-raid/e62894c8-d916-94bc-ef48-3c60e6e1fc5d@huawei.com Signed-off-by: Wu Guanghao <wuguanghao3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08md: fix rcu protection in md_wakeup_threadYun Zhou
We attempted to use RCU to protect the pointer 'thread', but directly passed the value when calling md_wakeup_thread(). This means that the RCU pointer has been acquired before rcu_read_lock(), which renders rcu_read_lock() ineffective and could lead to a use-after-free. Link: https://lore.kernel.org/linux-raid/20251015083227.1079009-1-yun.zhou@windriver.com Fixes: 446931543982 ("md: protect md_thread with rcu") Signed-off-by: Yun Zhou <yun.zhou@windriver.com> Reviewed-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08md: delete mddev kobj before deleting gendisk kobjXiao Ni
In sync del gendisk path, it deletes gendisk first and the directory /sys/block/md is removed. Then it releases mddev kobj in a delayed work. If we enable debug log in sysfs_remove_group, we can see the debug log 'sysfs group bitmap not found for kobject md'. It's the reason that the parent kobj has been deleted, so it can't find parent directory. In creating path, it allocs gendisk first, then adds mddev kobj. So it should delete mddev kobj before deleting gendisk. Before commit 9e59d609763f ("md: call del_gendisk in control path"), it releases mddev kobj first. If the kobj hasn't been deleted, it does clean job and deletes the kobj. Then it calls del_gendisk and releases gendisk kobj. So it doesn't need to call kobject_del to delete mddev kobj. After this patch, in sync del gendisk path, the sequence changes. So it needs to call kobject_del to delete mddev kobj. After this patch, the sequence is: 1. kobject del mddev kobj 2. del_gendisk deletes gendisk kobj 3. mddev_delayed_delete releases mddev kobj 4. md_kobj_release releases gendisk kobj Link: https://lore.kernel.org/linux-raid/20250928012424.61370-1-xni@redhat.com Fixes: 9e59d609763f ("md: call del_gendisk in control path") Signed-off-by: Xiao Ni <xni@redhat.com> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-05block: introduce disk_report_zone()Damien Le Moal
Commit b76b840fd933 ("dm: Fix dm-zoned-reclaim zone write pointer alignment") introduced an indirect call for the callback function of a report zones executed with blkdev_report_zones(). This is necessary so that the function disk_zone_wplug_sync_wp_offset() can be called to refresh a zone write plug zone write pointer offset after a write error. However, this solution makes following the path of a zone information harder to understand. Clean this up by introducing the new blk_report_zones_args structure to define a zone report callback and its private data and introduce the helper function disk_report_zone() which calls both disk_zone_wplug_sync_wp_offset() and the zone report user callback function for all zones of a zone report. This helper function must be called by all block device drivers that implement the report zones block operation in order to correctly report a zone information. All block device drivers supporting the report_zones block operation are updated to use this new scheme. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-24treewide: Remove in_irq()Matthew Wilcox (Oracle)
This old alias for in_hardirq() has been marked as deprecated since 2020; remove the stragglers. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024180654.1691095-1-willy@infradead.org
2025-10-03Merge tag 'for-6.18/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - a new dm-pcache target for read/write caching on persistent memory - fix typos in docs - misc small refactoring - mark dm-error with DM_TARGET_PASSES_INTEGRITY - dm-request-based: fix NULL pointer dereference and quiesce_depth out of sync - dm-linear: optimize REQ_PREFLUSH - dm-vdo: return error on corrupted metadata - dm-integrity: support asynchronous hash interface * tag 'for-6.18/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (27 commits) dm raid: use proper md_ro_state enumerators dm-integrity: prefer synchronous hash interface dm-integrity: enable asynchronous hash interface dm-integrity: rename internal_hash dm-integrity: add the "offset" argument dm-integrity: allocate the recalculate buffer with kmalloc dm-integrity: introduce integrity_kmap and integrity_kunmap dm-integrity: replace bvec_kmap_local with kmap_local_page dm-integrity: use internal variable for digestsize dm vdo: return error on corrupted metadata in start_restoring_volume functions dm vdo: Update code to use mem_is_zero dm: optimize REQ_PREFLUSH with data when using the linear target dm-pcache: use int type to store negative error codes dm: fix "writen"->"written" dm-pcache: cleanup: fix coding style report by checkpatch.pl dm-pcache: remove ctrl_lock for pcache_cache_segment dm: fix NULL pointer dereference in __dm_suspend() dm: fix queue start/stop imbalance under suspend/load/resume races dm-pcache: add persistent cache target in device-mapper dm error: mark as DM_TARGET_PASSES_INTEGRITY ...
2025-10-02Merge tag 'for-6.18/block-20250929' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - NVMe pull request via Keith: - FC target fixes (Daniel) - Authentication fixes and updates (Martin, Chris) - Admin controller handling (Kamaljit) - Target lockdep assertions (Max) - Keep-alive updates for discovery (Alastair) - Suspend quirk (Georg) - MD pull request via Yu: - Add support for a lockless bitmap. A key feature for the new bitmap are that the IO fastpath is lockless. If a user issues lots of write IO to the same bitmap bit in a short time, only the first write has additional overhead to update bitmap bit, no additional overhead for the following writes. By supporting only resync or recover written data, means in the case creating new array or replacing with a new disk, there is no need to do a full disk resync/recovery. - Switch ->getgeo() and ->bios_param() to using struct gendisk rather than struct block_device. - Rust block changes via Andreas. This series adds configuration via configfs and remote completion to the rnull driver. The series also includes a set of changes to the rust block device driver API: a few cleanup patches, and a few features supporting the rnull changes. The series removes the raw buffer formatting logic from `kernel::block` and improves the logic available in `kernel::string` to support the same use as the removed logic. - floppy arch cleanups - Reduce the number of dereferencing needed for ublk commands - Restrict supported sockets for nbd. Mostly done to eliminate a class of issues perpetually reported by syzbot, by using nonsensical socket setups. - A few s390 dasd block fixes - Fix a few issues around atomic writes - Improve DMA interation for integrity requests - Improve how iovecs are treated with regards to O_DIRECT aligment constraints. We used to require each segment to adhere to the constraints, now only the request as a whole needs to. - Clean up and improve p2p support, enabling use of p2p for metadata payloads - Improve locking of request lookup, using SRCU where appropriate - Use page references properly for brd, avoiding very long RCU sections - Fix ordering of recursively submitted IOs - Clean up and improve updating nr_requests for a live device - Various fixes and cleanups * tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (164 commits) s390/dasd: enforce dma_alignment to ensure proper buffer validation s390/dasd: Return BLK_STS_INVAL for EINVAL from do_dasd_request ublk: remove redundant zone op check in ublk_setup_iod() nvme: Use non zero KATO for persistent discovery connections nvmet: add safety check for subsys lock nvme-core: use nvme_is_io_ctrl() for I/O controller check nvme-core: do ioccsz/iorcsz validation only for I/O controllers nvme-core: add method to check for an I/O controller blk-cgroup: fix possible deadlock while configuring policy blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path blk-mq: Fix more tag iteration function documentation selftests: ublk: fix behavior when fio is not installed ublk: don't access ublk_queue in ublk_unmap_io() ublk: pass ublk_io to __ublk_complete_rq() ublk: don't access ublk_queue in ublk_need_complete_req() ublk: don't access ublk_queue in ublk_check_commit_and_fetch() ublk: don't pass ublk_queue to ublk_fetch() ublk: don't access ublk_queue in ublk_config_io_buf() ublk: don't access ublk_queue in ublk_check_fetch_buf() ublk: pass q_id and tag to __ublk_check_and_get_req() ...
2025-09-29Merge tag 'dlm-6.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This adds a dlm_release_lockspace() flag to request that node-failure recovery be performed for the node leaving the lockspace. The implementation of this flag requires coordination with userland clustering components. It's been requested for use by GFS2" * tag 'dlm-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: check for undefined release_option values dlm: handle release_option as unsigned dlm: move to rinfo for all middle conversion cases dlm: handle invalid lockspace member remove dlm: add new flag DLM_RELEASE_RECOVER for dlm_lockspace_release dlm: add new configfs entry release_recover for lockspace members dlm: add new RELEASE_RECOVER uevent attribute for release_lockspace dlm: use defines for force values in dlm_release_lockspace dlm: check for defined force value in dlm_lockspace_release
2025-09-23dm raid: use proper md_ro_state enumeratorsHeinz Mauelshagen
The dm-raid code was using hardcoded integer values to represent the read-only/read-write state of RAID arrays instead of the proper enumeration constants defined in the md_ro_state enumerator type. Changes: - Replace hardcoded integers with the appropriate md_ro_state enumerator values - Add the missing MD_RDONLY setting in the post_suspend function (no failures have been attributed to this inconsistency, the fix ensures correct state transitions for completeness) This improves code clarity and maintainability by using the defined enumeration constants rather than magic numbers, ensuring the code properly conforms to the established API interface. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm-integrity: prefer synchronous hash interfaceMikulas Patocka
The previous patch preferred async interface for the purpose of testing. However, the synchronous interface is faster, so it should be preferred. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm-integrity: enable asynchronous hash interfaceMikulas Patocka
This commit enables the asynchronous hash interface in dm-integrity. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm-integrity: rename internal_hashMikulas Patocka
Rename "internal_hash" to "internal_shash" and introduce a boolean value "internal_hash". Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm-integrity: add the "offset" argumentMikulas Patocka
Make sure that the "data" argument passed to integrity_sector_checksum is always page-aligned and add an "offset" argument that specifies the offset from the start of the page. This will enable us to use the asynchronous hash interface later. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm-integrity: allocate the recalculate buffer with kmallocMikulas Patocka
Allocate the recalculate buffer with kmalloc rather than vmalloc. This will be needed later, for the simplification of the asynchronous hash interface. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm-integrity: introduce integrity_kmap and integrity_kunmapMikulas Patocka
This abstraction will be used later, for the asynchronous hash interface. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm-integrity: replace bvec_kmap_local with kmap_local_pageMikulas Patocka
Replace bvec_kmap_local with kmap_local_page - it will be needed for the upcoming patches that make kmap_local_page optional, depending on whether asynchronous hash interface is used or not. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm-integrity: use internal variable for digestsizeMikulas Patocka
Instead of calling digestsize() each time the digestsize for the internal hash is needed, store the digestsize in a new field internal_hash_digestsize within struct dm_integrity_c once and use this value when needed. Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm vdo: return error on corrupted metadata in start_restoring_volume functionsIvan Abramov
The return values of VDO_ASSERT calls that validate metadata are not acted upon. Return UDS_CORRUPT_DATA in case of an error. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: a4eb7e255517 ("dm vdo: implement the volume index") Signed-off-by: Ivan Abramov <i.abramov@mt-integration.ru> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-23dm vdo: Update code to use mem_is_zeroBruce Johnston
Remove function that would check if data was all zeroes. Use the built-in kernel function mem_is_zero() instead. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-09-19Merge tag 'block-6.17-20250918' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: "A set of fixes for an issue with md array assembly and drbd for devices supporting write zeros" * tag 'block-6.17-20250918' of git://git.kernel.dk/linux: drbd: init queue_limits->max_hw_wzeroes_unmap_sectors parameter md: init queue_limits->max_hw_wzeroes_unmap_sectors parameter