summaryrefslogtreecommitdiff
path: root/drivers/md/md.c
AgeCommit message (Collapse)Author
2025-11-30md: remove legacy 1s delay in md_notify_rebootTarun Sahu
During system shutdown, the md driver registered notifier function (md_notify_reboot) currently imposes a hardcoded one-second delay. This delay was introduced approximately 23 years ago and was likely necessary for the hardware generation of that time. Proposing this patch to make sure there are no known devices that need this delay. Link: https://lore.kernel.org/linux-raid/20251121191422.2758555-1-tarunsahu@google.com Signed-off-by: Tarun Sahu <tarunsahu@google.com> Reviewed-by: Yu Kuai<yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-30md: warn about updating super block failureYu Kuai
Many personalities will handle IO error from daemon thread(like raid1d, raid10d, raid5d), and sb will require to be clean before hanlding these failed IO. However update sb can fail, for example array is broken by IO failure, or user config sysfs api array_state. This patch adds warning if updating sb failed first, in case this will be related to IO hang. Link: https://lore.kernel.org/linux-raid/20251117085557.770572-2-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>
2025-11-11md: allow configuring logical block sizeLi Nan
Previously, raid array used the maximum logical block size (LBS) of all member disks. Adding a larger LBS disk at runtime could unexpectedly increase RAID's LBS, risking corruption of existing partitions. This can be reproduced by: ``` # LBS of sd[de] is 512 bytes, sdf is 4096 bytes. mdadm -CRq /dev/md0 -l1 -n3 /dev/sd[de] missing --assume-clean # LBS is 512 cat /sys/block/md0/queue/logical_block_size # create partition md0p1 parted -s /dev/md0 mklabel gpt mkpart primary 1MiB 100% lsblk | grep md0p1 # LBS becomes 4096 after adding sdf mdadm --add -q /dev/md0 /dev/sdf cat /sys/block/md0/queue/logical_block_size # partition lost partprobe /dev/md0 lsblk | grep md0p1 ``` Simply restricting larger-LBS disks is inflexible. In some scenarios, only disks with 512 bytes LBS are available currently, but later, disks with 4KB LBS may be added to the array. Making LBS configurable is the best way to solve this scenario. After this patch, the raid will: - store LBS in disk metadata - add a read-write sysfs 'mdX/logical_block_size' Future mdadm should support setting LBS via metadata field during RAID creation and the new sysfs. Though the kernel allows runtime LBS changes, users should avoid modifying it after creating partitions or filesystems to prevent compatibility issues. Only 1.x metadata supports configurable LBS. 0.90 metadata inits all fields to default values at auto-detect. Supporting 0.90 would require more extensive changes and no such use case has been observed. Note that many RAID paths rely on PAGE_SIZE alignment, including for metadata I/O. A larger LBS than PAGE_SIZE will result in metadata read/write failures. So this config should be prevented. Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-6-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11md: add check_new_feature module parameterLi Nan
Raid checks if pad3 is zero when loading superblock from disk. Arrays created with new features may fail to assemble on old kernels as pad3 is used. Add module parameter check_new_feature to bypass this check. Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-5-linan666@huaweicloud.com Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11md: init bioset in mddev_initLi Nan
IO operations may be needed before md_run(), such as updating metadata after writing sysfs. Without bioset, this triggers a NULL pointer dereference as below: BUG: kernel NULL pointer dereference, address: 0000000000000020 Call Trace: md_update_sb+0x658/0xe00 new_level_store+0xc5/0x120 md_attr_store+0xc9/0x1e0 sysfs_kf_write+0x6f/0xa0 kernfs_fop_write_iter+0x141/0x2a0 vfs_write+0x1fc/0x5a0 ksys_write+0x79/0x180 __x64_sys_write+0x1d/0x30 x64_sys_call+0x2818/0x2880 do_syscall_64+0xa9/0x580 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Reproducer ``` mdadm -CR /dev/md0 -l1 -n2 /dev/sd[cd] echo inactive > /sys/block/md0/md/array_state echo 10 > /sys/block/md0/md/new_level ``` mddev_init() can only be called once per mddev, no need to test if bioset has been initialized anymore. Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-3-linan666@huaweicloud.com Fixes: d981ed841930 ("md: Add new_level sysfs interface") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11md: delete md_redundancy_group when array is becoming inactiveLi Nan
'md_redundancy_group' are created in md_run() and deleted in del_gendisk(), but these are not paired. Writing inactive/active to sysfs array_state can trigger md_run() multiple times without del_gendisk(), leading to duplicate creation as below: sysfs: cannot create duplicate filename '/devices/virtual/block/md0/md/sync_action' Call Trace: dump_stack_lvl+0x9f/0x120 dump_stack+0x14/0x20 sysfs_warn_dup+0x96/0xc0 sysfs_add_file_mode_ns+0x19c/0x1b0 internal_create_group+0x213/0x830 sysfs_create_group+0x17/0x20 md_run+0x856/0xe60 ? __x64_sys_openat+0x23/0x30 do_md_run+0x26/0x1d0 array_state_store+0x559/0x760 md_attr_store+0xc9/0x1e0 sysfs_kf_write+0x6f/0xa0 kernfs_fop_write_iter+0x141/0x2a0 vfs_write+0x1fc/0x5a0 ksys_write+0x79/0x180 __x64_sys_write+0x1d/0x30 x64_sys_call+0x2818/0x2880 do_syscall_64+0xa9/0x580 entry_SYSCALL_64_after_hwframe+0x4b/0x53 md: cannot register extra attributes for md0 Creation of it depends on 'pers', its lifecycle cannot be aligned with gendisk. So fix this issue by triggering 'md_redundancy_group' deletion when the array is becoming inactive. Link: https://lore.kernel.org/linux-raid/20251103125757.1405796-2-linan666@huaweicloud.com Fixes: 790abe4d77af ("md: remove/add redundancy group only in level change") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-11md: prevent adding disks with larger logical_block_size to active arraysLi Nan
When adding a disk to a md array, avoid updating the array's logical_block_size to match the new disk. This prevents accidental partition table loss that renders the array unusable. The later patch will introduce a way to configure the array's logical_block_size. The issue was introduced before Linux 2.6.12-rc2. Link: https://lore.kernel.org/linux-raid/20250918115759.334067-2-linan666@huaweicloud.com/ Fixes: d2e45eace8 ("[PATCH] Fix raid "bio too big" failures") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08md: avoid repeated calls to del_gendiskXiao Ni
There is a uaf problem which is found by case 23rdev-lifetime: Oops: general protection fault, probably for non-canonical address 0xdead000000000122 RIP: 0010:bdi_unregister+0x4b/0x170 Call Trace: <TASK> __del_gendisk+0x356/0x3e0 mddev_unlock+0x351/0x360 rdev_attr_store+0x217/0x280 kernfs_fop_write_iter+0x14a/0x210 vfs_write+0x29e/0x550 ksys_write+0x74/0xf0 do_syscall_64+0xbb/0x380 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7ff5250a177e The sequence is: 1. rdev remove path gets reconfig_mutex 2. rdev remove path release reconfig_mutex in mddev_unlock 3. md stop calls do_md_stop and sets MD_DELETED 4. rdev remove path calls del_gendisk because MD_DELETED is set 5. md stop path release reconfig_mutex and calls del_gendisk again So there is a race condition we should resolve. This patch adds a flag MD_DO_DELETE to avoid the race condition. Link: https://lore.kernel.org/linux-raid/20251029063419.21700-1-xni@redhat.com Fixes: 9e59d609763f ("md: call del_gendisk in control path") Signed-off-by: Xiao Ni <xni@redhat.com> Suggested-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08Factor out code into md_should_do_recovery()Wu Guanghao
In md_check_recovery(), use new helper to make code cleaner. Link: https://lore.kernel.org/linux-raid/e62894c8-d916-94bc-ef48-3c60e6e1fc5d@huawei.com Signed-off-by: Wu Guanghao <wuguanghao3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08md: fix rcu protection in md_wakeup_threadYun Zhou
We attempted to use RCU to protect the pointer 'thread', but directly passed the value when calling md_wakeup_thread(). This means that the RCU pointer has been acquired before rcu_read_lock(), which renders rcu_read_lock() ineffective and could lead to a use-after-free. Link: https://lore.kernel.org/linux-raid/20251015083227.1079009-1-yun.zhou@windriver.com Fixes: 446931543982 ("md: protect md_thread with rcu") Signed-off-by: Yun Zhou <yun.zhou@windriver.com> Reviewed-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-11-08md: delete mddev kobj before deleting gendisk kobjXiao Ni
In sync del gendisk path, it deletes gendisk first and the directory /sys/block/md is removed. Then it releases mddev kobj in a delayed work. If we enable debug log in sysfs_remove_group, we can see the debug log 'sysfs group bitmap not found for kobject md'. It's the reason that the parent kobj has been deleted, so it can't find parent directory. In creating path, it allocs gendisk first, then adds mddev kobj. So it should delete mddev kobj before deleting gendisk. Before commit 9e59d609763f ("md: call del_gendisk in control path"), it releases mddev kobj first. If the kobj hasn't been deleted, it does clean job and deletes the kobj. Then it calls del_gendisk and releases gendisk kobj. So it doesn't need to call kobject_del to delete mddev kobj. After this patch, in sync del gendisk path, the sequence changes. So it needs to call kobject_del to delete mddev kobj. After this patch, the sequence is: 1. kobject del mddev kobj 2. del_gendisk deletes gendisk kobj 3. mddev_delayed_delete releases mddev kobj 4. md_kobj_release releases gendisk kobj Link: https://lore.kernel.org/linux-raid/20250928012424.61370-1-xni@redhat.com Fixes: 9e59d609763f ("md: call del_gendisk in control path") Signed-off-by: Xiao Ni <xni@redhat.com> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2025-10-02Merge tag 'for-6.18/block-20250929' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - NVMe pull request via Keith: - FC target fixes (Daniel) - Authentication fixes and updates (Martin, Chris) - Admin controller handling (Kamaljit) - Target lockdep assertions (Max) - Keep-alive updates for discovery (Alastair) - Suspend quirk (Georg) - MD pull request via Yu: - Add support for a lockless bitmap. A key feature for the new bitmap are that the IO fastpath is lockless. If a user issues lots of write IO to the same bitmap bit in a short time, only the first write has additional overhead to update bitmap bit, no additional overhead for the following writes. By supporting only resync or recover written data, means in the case creating new array or replacing with a new disk, there is no need to do a full disk resync/recovery. - Switch ->getgeo() and ->bios_param() to using struct gendisk rather than struct block_device. - Rust block changes via Andreas. This series adds configuration via configfs and remote completion to the rnull driver. The series also includes a set of changes to the rust block device driver API: a few cleanup patches, and a few features supporting the rnull changes. The series removes the raw buffer formatting logic from `kernel::block` and improves the logic available in `kernel::string` to support the same use as the removed logic. - floppy arch cleanups - Reduce the number of dereferencing needed for ublk commands - Restrict supported sockets for nbd. Mostly done to eliminate a class of issues perpetually reported by syzbot, by using nonsensical socket setups. - A few s390 dasd block fixes - Fix a few issues around atomic writes - Improve DMA interation for integrity requests - Improve how iovecs are treated with regards to O_DIRECT aligment constraints. We used to require each segment to adhere to the constraints, now only the request as a whole needs to. - Clean up and improve p2p support, enabling use of p2p for metadata payloads - Improve locking of request lookup, using SRCU where appropriate - Use page references properly for brd, avoiding very long RCU sections - Fix ordering of recursively submitted IOs - Clean up and improve updating nr_requests for a live device - Various fixes and cleanups * tag 'for-6.18/block-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (164 commits) s390/dasd: enforce dma_alignment to ensure proper buffer validation s390/dasd: Return BLK_STS_INVAL for EINVAL from do_dasd_request ublk: remove redundant zone op check in ublk_setup_iod() nvme: Use non zero KATO for persistent discovery connections nvmet: add safety check for subsys lock nvme-core: use nvme_is_io_ctrl() for I/O controller check nvme-core: do ioccsz/iorcsz validation only for I/O controllers nvme-core: add method to check for an I/O controller blk-cgroup: fix possible deadlock while configuring policy blk-mq: fix null-ptr-deref in blk_mq_free_tags() from error path blk-mq: Fix more tag iteration function documentation selftests: ublk: fix behavior when fio is not installed ublk: don't access ublk_queue in ublk_unmap_io() ublk: pass ublk_io to __ublk_complete_rq() ublk: don't access ublk_queue in ublk_need_complete_req() ublk: don't access ublk_queue in ublk_check_commit_and_fetch() ublk: don't pass ublk_queue to ublk_fetch() ublk: don't access ublk_queue in ublk_config_io_buf() ublk: don't access ublk_queue in ublk_check_fetch_buf() ublk: pass q_id and tag to __ublk_check_and_get_req() ...
2025-09-06md/md-llbitmap: introduce new lockless bitmapYu Kuai
Redundant data is used to enhance data fault tolerance, and the storage method for redundant data vary depending on the RAID levels. And it's important to maintain the consistency of redundant data. Bitmap is used to record which data blocks have been synchronized and which ones need to be resynchronized or recovered. Each bit in the bitmap represents a segment of data in the array. When a bit is set, it indicates that the multiple redundant copies of that data segment may not be consistent. Data synchronization can be performed based on the bitmap after power failure or readding a disk. If there is no bitmap, a full disk synchronization is required. Due to known performance issues with md-bitmap and the unreasonable implementations: - self-managed IO submitting like filemap_write_page(); - global spin_lock I have decided not to continue optimizing based on the current bitmap implementation, this new bitmap is invented without locking from IO fast path and can be used with fast disks. For designs and details, see the comments in drivers/md-llbitmap.c. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-12-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06md/md-bitmap: make method bitmap_ops->daemon_work optionalYu Kuai
daemon_work() will be called by daemon thread, on the one hand, daemon thread doesn't have strict wake-up time; on the other hand, too much work are put to daemon thread, like handle sync IO, handle failed or specail normal IO, handle recovery, and so on. Hence daemon thread may be too busy to clear dirty bits in time. Make bitmap_ops->daemon_work() optional and following patches will use separate async work to clear dirty bits for the new bitmap. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-11-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVERYu Kuai
This flag is used by llbitmap in later patches to skip raid456 initial recover and delay building initial xor data to first write. https://lore.kernel.org/linux-raid/20250829080426.1441678-10-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-09-06md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operationsYu Kuai
This method is used to check if blocks can be skipped before calling into pers->sync_request(), llbitmap will use this method to skip resync for unwritten/clean data blocks, and recovery/check/repair for unwritten data blocks; Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-8-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06md/md-bitmap: delay registration of bitmap_ops until creating bitmapYu Kuai
Currently bitmap_ops is registered while allocating mddev, this is fine when there is only one bitmap_ops. Delay setting bitmap_ops until creating bitmap, so that user can choose which bitmap to use before running the array. Link: https://lore.kernel.org/linux-raid/20250721171557.34587-7-yukuai@kernel.org Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06md/md-bitmap: add a new sysfs api bitmap_typeYu Kuai
The api will be used by mdadm to set bitmap_type while creating new array or assembling array, prepare to add a new bitmap. Currently available options are: cat /sys/block/md0/md/bitmap_type none [bitmap] Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-6-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06md: add a new mddev field 'bitmap_id'Yu Kuai
Prepare to store the bitmap id selected by user, also refactor mddev_set_bitmap_ops a bit in case the value is invalid. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-5-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06md/md-bitmap: support discard for bitmap opsYu Kuai
Use two new methods {start, end}_discard in bitmap_ops and a new field 'rw' in struct md_io_clone to handle discard IO, prepare to support new md bitmap. Since all bitmap functions to hanlde write IO are the same, also add typedef to make code cleaner. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-4-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06md: factor out a helper raid_is_456()Yu Kuai
There are no functional changes, the helper will be used by llbitmap in following patches. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-3-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06md: add a new parameter 'offset' to md_super_write()Yu Kuai
The parameter is always set to 0 for now, following patches will use this helper to write llbitmap to underlying disks, allow writing dirty sectors instead of the whole page. Also rename md_super_write to md_write_metadata since there is nothing super-block specific. Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-2-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com>
2025-09-06md/md-bitmap: introduce CONFIG_MD_BITMAPYu Kuai
Now that all implementations are internal, it's sensible to add a config option for md-bitmap, and it's a good way for isolation. Link: https://lore.kernel.org/linux-raid/20250707012711.376844-16-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06md: check before referencing mddev->bitmap_opsYu Kuai
Prepare to introduce CONFIG_MD_BITMAP. Link: https://lore.kernel.org/linux-raid/20250707012711.376844-15-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-06md/md-bitmap: merge md_bitmap_group into bitmap_operationsYu Kuai
Now that all bitmap implementations are internal, it doesn't make sense to export md_bitmap_group anymore. Link: https://lore.kernel.org/linux-raid/20250707012711.376844-5-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-09-05md: prevent incorrect update of resync/recovery offsetLi Nan
In md_do_sync(), when md_sync_action returns ACTION_FROZEN, subsequent call to md_sync_position() will return MaxSector. This causes 'curr_resync' (and later 'recovery_offset') to be set to MaxSector too, which incorrectly signals that recovery/resync has completed, even though disk data has not actually been updated. To fix this issue, skip updating any offset values when the sync action is FROZEN. The same holds true for IDLE. Fixes: 7d9f107a4e94 ("md: use new helpers in md_do_sync()") Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/linux-raid/20250904073452.3408516-1-linan666@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-09-03Merge tag 'pull-getgeo' of ↵Jens Axboe
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into for-6.18/block Pull struct block_device getgeo changes from Al. "switching ->getgeo() from struct block_device to struct gendisk Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>" * tag 'pull-getgeo' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: block: switch ->getgeo() to struct gendisk scsi: switch ->bios_param() to passing gendisk scsi: switch scsi_bios_ptable() and scsi_partsize() to gendisk
2025-08-16md: fix sync_action incorrect display during resyncZheng Qixing
During raid resync, if a disk becomes faulty, the operation is briefly interrupted. The MD_RECOVERY_RECOVER flag triggered by the disk failure causes sync_action to incorrectly show "recover" instead of "resync". The same issue affects reshape operations. Reproduction steps: mdadm -Cv /dev/md1 -l1 -n4 -e1.2 /dev/sd{a..d} // -> resync happened mdadm -f /dev/md1 /dev/sda // -> resync interrupted cat sync_action -> recover Add progress checks in md_sync_action() for resync/recover/reshape to ensure the interface correctly reports the actual operation type. Fixes: 4b10a3bc67c1 ("md: ensure resync is prioritized over recovery") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Link: https://lore.kernel.org/linux-raid/20250816002534.1754356-3-zhengqixing@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-08-16md: add helper rdev_needs_recovery()Zheng Qixing
Add a helper for checking if an rdev needs recovery. Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Link: https://lore.kernel.org/linux-raid/20250816002534.1754356-2-zhengqixing@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-08-16md: keep recovery_cp in mdp_superblock_sXiao Ni
commit 907a99c314a5 ("md: rename recovery_cp to resync_offset") replaces recovery_cp with resync_offset in mdp_superblock_s which is in md_p.h. md_p.h is used in userspace too. So mdadm building fails because of this. This patch revert this change. Fixes: 907a99c314a5 ("md: rename recovery_cp to resync_offset") Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20250815040028.18085-1-xni@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-08-13md: add legacy_async_del_gendisk modeXiao Ni
commit 9e59d609763f ("md: call del_gendisk in control path") changes the async way to sync way of calling del_gendisk. But it breaks mdadm --assemble command. The assemble command runs like this: 1. create the array 2. stop the array 3. access the sysfs files after stopping The sync way calls del_gendisk in step 2, so all sysfs files are removed. Now to avoid breaking mdadm assemble command, this patch adds the parameter legacy_async_del_gendisk that can be used to choose which way. The default is async way. In future, we plan to change default to sync way in kernel 7.0. Then users need to upgrade to mdadm 4.5+ which removes step 2. Fixes: 9e59d609763f ("md: call del_gendisk in control path") Reported-by: Mikulas Patocka <mpatocka@redhat.com> Closes: https://lore.kernel.org/linux-raid/CAMw=ZnQ=ET2St-+hnhsuq34rRPnebqcXqP1QqaHW5Bh4aaaZ4g@mail.gmail.com/T/#t Suggested-and-reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Link: https://lore.kernel.org/linux-raid/20250813032929.54978-1-xni@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-08-13block: switch ->getgeo() to struct gendiskAl Viro
Instances are happier that way and it makes more sense anyway - the only part of the result that is related to partition we are given is the start sector, and that has been filled in by the caller. Everything else is a function of the disk. Only one instance (DASD) is ever looking at anything other than bdev->bd_disk and that one is trivial to adjust. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-08-03md: make rdev_addable usable for rcu modeYang Erkun
Our testcase trigger panic: BUG: kernel NULL pointer dereference, address: 00000000000000e0 ... Oops: Oops: 0000 [#1] SMP NOPTI CPU: 2 UID: 0 PID: 85 Comm: kworker/2:1 Not tainted 6.16.0+ #94 PREEMPT(none) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014 Workqueue: md_misc md_start_sync RIP: 0010:rdev_addable+0x4d/0xf0 ... Call Trace: <TASK> md_start_sync+0x329/0x480 process_one_work+0x226/0x6d0 worker_thread+0x19e/0x340 kthread+0x10f/0x250 ret_from_fork+0x14d/0x180 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: raid10 CR2: 00000000000000e0 ---[ end trace 0000000000000000 ]--- RIP: 0010:rdev_addable+0x4d/0xf0 md_spares_need_change in md_start_sync will call rdev_addable which protected by rcu_read_lock/rcu_read_unlock. This rcu context will help protect rdev won't be released, but rdev->mddev will be set to NULL before we call synchronize_rcu in md_kick_rdev_from_array. Fix this by using READ_ONCE and check does rdev->mddev still alive. Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration") Fixes: 570b9147deb6 ("md: use RCU lock to protect traversal in md_spares_need_change()") Signed-off-by: Yang Erkun <yangerkun@huawei.com> Link: https://lore.kernel.org/linux-raid/20250731114530.776670-1-yangerkun@huawei.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-07-31md: rename recovery_cp to resync_offsetLi Nan
'recovery_cp' was used to represent the progress of sync, but its name contains recovery, which can cause confusion. Replaces 'recovery_cp' with 'resync_offset' for clarity. Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/linux-raid/20250722033340.1933388-1-linan666@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-07-31md/md-cluster: handle REMOVE message earlierHeming Zhao
Commit a1fd37f97808 ("md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctl") introduced a regression in the md_cluster module. (Failed cases 02r1_Manage_re-add & 02r10_Manage_re-add) Consider a 2-node cluster: - node1 set faulty & remove command on a disk. - node2 must correctly update the array metadata. Before a1fd37f97808, on node1, the delay between msg:METADATA_UPDATED (triggered by faulty) and msg:REMOVE was sufficient for node2 to reload the disk info (written by node1). After a1fd37f97808, node1 no longer waits between faulty and remove, causing it to send msg:REMOVE while node2 is still reloading disk info. This often results in node2 failing to remove the faulty disk. == how to trigger == set up a 2-node cluster (node1 & node2) with disks vdc & vdd. on node1: mdadm -CR /dev/md0 -l1 -b clustered -n2 /dev/vdc /dev/vdd --assume-clean ssh node2-ip mdadm -A /dev/md0 /dev/vdc /dev/vdd mdadm --manage /dev/md0 --fail /dev/vdc --remove /dev/vdc check array status on both nodes with "mdadm -D /dev/md0". node1 output: Number Major Minor RaidDevice State - 0 0 0 removed 1 254 48 1 active sync /dev/vdd node2 output: Number Major Minor RaidDevice State - 0 0 0 removed 1 254 48 1 active sync /dev/vdd 0 254 32 - faulty /dev/vdc Fixes: a1fd37f97808 ("md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctl") Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Su Yue <glass.su@suse.com> Link: https://lore.kernel.org/linux-raid/20250728042145.9989-1-heming.zhao@suse.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-07-31md: fix create on open mddev lifetime regressionYu Kuai
Commit 9e59d609763f ("md: call del_gendisk in control path") moves setting MD_DELETED from __mddev_put() to do_md_stop(), however, for the case create on open, mddev can be freed without do_md_stop(): 1) open md_probe md_alloc_and_put md_alloc mddev_alloc atomic_set(&mddev->active, 1); mddev->hold_active = UNTIL_IOCTL mddev_put atomic_dec_and_test(&mddev->active) if (mddev->hold_active) -> active is 0, hold_active is set md_open mddev_get atomic_inc(&mddev->active); 2) ioctl that is not STOP_ARRAY, for example, GET_ARRAY_INFO: md_ioctl mddev->hold_active = 0 3) close md_release mddev_put(mddev); atomic_dec_and_lock(&mddev->active, &all_mddevs_lock) __mddev_put -> hold_active is cleared, mddev will be freed queue_work(md_misc_wq, &mddev->del_work) Now that MD_DELETED is not set, before mddev is freed by mddev_delayed_delete(), md_open can still succeed and break mddev lifetime, causing mddev->kobj refcount underflow or mddev uaf problem. Fix this problem by setting MD_DELETED before queuing del_work. Reported-by: syzbot+9921e319bd6168140b40@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68894408.a00a0220.26d0e1.0012.GAE@google.com/ Reported-by: syzbot+fa3a12519f0d3fd4ec16@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68894408.a00a0220.26d0e1.0013.GAE@google.com/ Fixes: 9e59d609763f ("md: call del_gendisk in control path") Link: https://lore.kernel.org/linux-raid/20250730073321.2583158-1-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-07-12md: allow removing faulty rdev during resyncZheng Qixing
During RAID resync, faulty rdev cannot be removed and will result in "Device or resource busy" error when attempting hot removal. Reproduction steps: mdadm -Cv /dev/md0 -l1 -n3 -e1.2 /dev/sd{b..d} mdadm /dev/md0 -f /dev/sdb mdadm /dev/md0 -r /dev/sdb -> mdadm: hot remove failed for /dev/sdb: Device or resource busy After commit 4b10a3bc67c1 ("md: ensure resync is prioritized over recovery"), when a device becomes faulty during resync, the md_choose_sync_action() function returns early without calling remove_and_add_spares(), preventing faulty device removal. This patch extracts a helper function remove_spares() to support removing faulty devices during RAID resync operations. Fixes: 4b10a3bc67c1 ("md: ensure resync is prioritized over recovery") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/linux-raid/20250707075412.150301-1-zhengqixing@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-07-12md: remove/add redundancy group only in level changeXiao Ni
del_gendisk is called in synchronous way now. So it doesn't need to handle redundancy group in stop path separately. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20250611073108.25463-4-xni@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-07-12md: Don't clear MD_CLOSING until mddev is freedXiao Ni
UNTIL_STOP is used to avoid mddev is freed on the last close before adding disks to mddev. And it should be cleared when stopping an array which is mentioned in commit efeb53c0e572 ("md: Allow md devices to be created by name."). So reset ->hold_active to 0 in md_clean. And MD_CLOSING should be kept until mddev is freed to avoid reopen. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20250611073108.25463-3-xni@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-07-12md: call del_gendisk in control pathXiao Ni
Now del_gendisk and put_disk are called asynchronously in workqueue work. The asynchronous way has a problem that the device node can still exist after mdadm --stop command returns in a short window. So udev rule can open this device node and create the struct mddev in kernel again. So put del_gendisk in control path and still leave put_disk in md_kobj_release to avoid uaf of gendisk. Function del_gendisk can't be called with reconfig_mutex. If it's called with reconfig mutex, a deadlock can happen. del_gendisk waits all sysfs files access to finish and sysfs file access waits reconfig mutex. So put del_gendisk after releasing reconfig mutex. But there is still a window that sysfs can be accessed between mddev_unlock and del_gendisk. So some actions (add disk, change level, .e.g) can happen which lead unexpected results. MD_DELETED is used to resolve this problem. MD_DELETED is set before releasing reconfig mutex and it should be checked for these sysfs access which need reconfig mutex. For sysfs access which don't need reconfig mutex, del_gendisk will wait them to finish. But it doesn't need to do this in function mddev_lock_nointr. There are ten places that call it. * Five of them are in dm raid which we don't need to care. MD_DELETED is only used for md raid. * stop_sync_thread, md_do_sync and md_start_sync are related sync request, and it needs to wait sync thread to finish before stopping an array. * md_ioctl: md_open is called before md_ioctl, so ->openers is added. It will fail to stop the array. So it doesn't need to check MD_DELETED here * md_set_readonly: It needs to call mddev_set_closing_and_sync_blockdev when setting readonly or read_auto. So it will fail to stop the array too because MD_CLOSING is already set. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20250611073108.25463-2-xni@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-06-08treewide, timers: Rename from_timer() to timer_container_of()Ingo Molnar
Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-05-30md/md-bitmap: remove parameter slot from bitmap_create()Yu Kuai
All callers pass in '-1' for 'slot', hence it can be removed. Link: https://lore.kernel.org/linux-raid/20250524061320.370630-6-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de>
2025-05-30md/md-bitmap: cleanup bitmap_ops->startwrite()Yu Kuai
bitmap_startwrite() always return 0, and the caller doesn't check return value as well, hence change the method to void. Also rename startwrite/endwrite to start_write/end_write, which is more in line with the usual naming convention. Link: https://lore.kernel.org/linux-raid/20250524061320.370630-4-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de>
2025-05-10md: fix is_mddev_idle()Yu Kuai
If sync_speed is above speed_min, then is_mddev_idle() will be called for each sync IO to check if the array is idle, and inflight sync_io will be limited if the array is not idle. However, while mkfs.ext4 for a large raid5 array while recovery is in progress, it's found that sync_speed is already above speed_min while lots of stripes are used for sync IO, causing long delay for mkfs.ext4. Root cause is the following checking from is_mddev_idle(): t1: submit sync IO: events1 = completed IO - issued sync IO t2: submit next sync IO: events2 = completed IO - issued sync IO if (events2 - events1 > 64) For consequence, the more sync IO issued, the less likely checking will pass. And when completed normal IO is more than issued sync IO, the condition will finally pass and is_mddev_idle() will return false, however, last_events will be updated hence is_mddev_idle() can only return false once in a while. Fix this problem by changing the checking as following: 1) mddev doesn't have normal IO completed; 2) mddev doesn't have normal IO inflight; 3) if any member disks is partition, and all other partitions doesn't have IO completed. Also change rdev->last_events to unsigned long to cleanup type casting. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-9-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-10md: add a new api sync_io_depthYu Kuai
Currently if sync speed is above speed_min and below speed_max, md_do_sync() will wait for all sync IOs to be done before issuing new sync IO, means sync IO depth is limited to just 1. This limit is too low, in order to prevent sync speed drop conspicuously after fixing is_mddev_idle() in the next patch, add a new api for limiting sync IO depth, the default value is 32. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-8-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-04-05treewide: Switch/rename to timer_delete[_sync]()Thomas Gleixner
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-13Merge tag 'md-6.15-20250312' of ↵Jens Axboe
https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-6.15/block Merge MD changes from Yu: "- fix recovery can preempt resync (Li Nan) - fix md-bitmap IO limit (Su Yue) - fix raid10 discard with REQ_NOWAIT (Xiao Ni) - fix raid1 memory leak (Zheng Qixing) - fix mddev uaf (Yu Kuai) - fix raid1,raid10 IO flags (Yu Kuai) - some refactor and cleanup (Yu Kuai)" * tag 'md-6.15-20250312' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux: md/raid10: wait barrier before returning discard request with REQ_NOWAIT md/md-bitmap: fix wrong bitmap_limit for clustermd when write sb md/raid1,raid10: don't ignore IO flags md/raid5: merge reshape_progress checking inside get_reshape_loc() md: fix mddev uaf while iterating all_mddevs list md: switch md-cluster to use md_submodle_head md: don't export md_cluster_ops md/md-cluster: cleanup md_cluster_ops reference md: switch personalities to use md_submodule_head md: introduce struct md_submodule_head and APIs md: only include md-cluster.h if necessary md: merge common code into find_pers() md/raid1: fix memory leak in raid1_run() if no active rdev md: ensure resync is prioritized over recovery
2025-03-06md: improve return types of badblocks handling functionsZheng Qixing
rdev_set_badblocks() only indicates success/failure, so convert its return type from int to boolean for better semantic clarity. rdev_clear_badblocks() return value is never used by any caller, convert it to void. This removes unnecessary value returns. Also update narrow_write_error() in both raid1 and raid10 to use boolean return type to match rdev_set_badblocks(). Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227075507.151331-12-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06badblocks: return boolean from badblocks_set() and badblocks_clear()Zheng Qixing
Change the return type of badblocks_set() and badblocks_clear() from int to bool, indicating success or failure. Specifically: - _badblocks_set() and _badblocks_clear() functions now return true for success and false for failure. - All calls to these functions are updated to handle the new boolean return type. - This change improves code clarity and ensures a more consistent handling of success and failure states. Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Acked-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20250227075507.151331-11-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-05md: fix mddev uaf while iterating all_mddevs listYu Kuai
While iterating all_mddevs list from md_notify_reboot() and md_exit(), list_for_each_entry_safe is used, and this can race with deletint the next mddev, causing UAF: t1: spin_lock //list_for_each_entry_safe(mddev, n, ...) mddev_get(mddev1) // assume mddev2 is the next entry spin_unlock t2: //remove mddev2 ... mddev_free spin_lock list_del spin_unlock kfree(mddev2) mddev_put(mddev1) spin_lock //continue dereference mddev2->all_mddevs The old helper for_each_mddev() actually grab the reference of mddev2 while holding the lock, to prevent from being freed. This problem can be fixed the same way, however, the code will be complex. Hence switch to use list_for_each_entry, in this case mddev_put() can free the mddev1 and it's not safe as well. Refer to md_seq_show(), also factor out a helper mddev_put_locked() to fix this problem. Cc: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-raid/20250220124348.845222-1-yukuai1@huaweicloud.com Fixes: f26514342255 ("md: stop using for_each_mddev in md_notify_reboot") Fixes: 16648bac862f ("md: stop using for_each_mddev in md_exit") Reported-and-tested-by: Guillaume Morin <guillaume@morinfr.org> Closes: https://lore.kernel.org/all/Z7Y0SURoA8xwg7vn@bender.morinfr.org/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de>