summaryrefslogtreecommitdiff
path: root/fs/xfs/xfs_zone_alloc.c
AgeCommit message (Collapse)Author
5 daysMerge tag 'xfs-merge-6.19' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Carlos Maiolino: "There are no major changes in xfs. This contains mostly some code cleanups, a few bug fixes and documentation update. Highlights are: - Quota locking cleanup - Getting rid of old xlog_in_core_2_t type" * tag 'xfs-merge-6.19' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (33 commits) docs: remove obsolete links in the xfs online repair documentation xfs: move some code out of xfs_iget_recycle xfs: use zi more in xfs_zone_gc_mount xfs: remove the unused bv field in struct xfs_gc_bio xfs: remove xarray mark for reclaimable zones xfs: remove the xlog_in_core_t typedef xfs: remove l_iclog_heads xfs: remove the xlog_rec_header_t typedef xfs: remove xlog_in_core_2_t xfs: remove a very outdated comment from xlog_alloc_log xfs: cleanup xlog_alloc_log a bit xfs: don't use xlog_in_core_2_t in struct xlog_in_core xfs: add a on-disk log header cycle array accessor xfs: add a XLOG_CYCLE_DATA_SIZE constant xfs: reduce ilock roundtrips in xfs_qm_vop_dqalloc xfs: move xfs_dquot_tree calls into xfs_qm_dqget_cache_{lookup,insert} xfs: move quota locking into xrep_quota_item xfs: move quota locking into xqcheck_commit_dquot xfs: move q_qlock locking into xqcheck_compare_dquot xfs: move q_qlock locking into xchk_quota_item ...
5 daysMerge tag 'for-6.19/block-20251201' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - Fix head insertion for mq-deadline, a regression from when priority support was added - Series simplifying and improving the ublk user copy code - Various ublk related cleanups - Fixup REQ_NOWAIT handling in loop/zloop, clearing NOWAIT when the request is punted to a thread for handling - Merge and then later revert loop dio nowait support, as it ended up causing excessive stack usage for when the inline issue code needs to dip back into the full file system code - Improve auto integrity code, making it less deadlock prone - Speedup polled IO handling, but manually managing the hctx lookups - Fixes for blk-throttle for SSD devices - Small series with fixes for the S390 dasd driver - Add support for caching zones, avoiding unnecessary report zone queries - MD pull requests via Yu: - fix null-ptr-dereference regression for dm-raid0 - fix IO hang for raid5 when array is broken with IO inflight - remove legacy 1s delay to speed up system shutdown - change maintainer's email address - data can be lost if array is created with different lbs devices, fix this problem and record lbs of the array in metadata - fix rcu protection for md_thread - fix mddev kobject lifetime regression - enable atomic writes for md-linear - some cleanups - bcache updates via Coly - remove useless discard and cache device code - improve usage of per-cpu workqueues - Reorganize the IO scheduler switching code, fixing some lockdep reports as well - Improve the block layer P2P DMA support - Add support to the block tracing code for zoned devices - Segment calculation improves, and memory alignment flexibility improvements - Set of prep and cleanups patches for ublk batching support. The actual batching hasn't been added yet, but helps shrink down the workload of getting that patchset ready for 6.20 - Fix for how the ps3 block driver handles segments offsets - Improve how block plugging handles batch tag allocations - nbd fixes for use-after-free of the configuration on device clear/put - Set of improvements and fixes for zloop - Add Damien as maintainer of the block zoned device code handling - Various other fixes and cleanups * tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits) block/rnbd: correct all kernel-doc complaints blk-mq: use queue_hctx in blk_mq_map_queue_type md: remove legacy 1s delay in md_notify_reboot md/raid5: fix IO hang when array is broken with IO inflight md: warn about updating super block failure md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid sbitmap: fix all kernel-doc warnings ublk: add helper of __ublk_fetch() ublk: pass const pointer to ublk_queue_is_zoned() ublk: refactor auto buffer register in ublk_dispatch_req() ublk: add `union ublk_io_buf` with improved naming ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg() kfifo: add kfifo_alloc_node() helper for NUMA awareness blk-mq: fix potential uaf for 'queue_hw_ctx' blk-mq: use array manage hctx map instead of xarray ublk: prevent invalid access with DEBUG s390/dasd: Use scnprintf() instead of sprintf() s390/dasd: Move device name formatting into separate function s390/dasd: Remove unnecessary debugfs_create() return checks s390/dasd: Fix gendisk parent after copy pair swap ...
7 daysMerge tag 'vfs-6.19-rc1.writeback' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull writeback updates from Christian Brauner: "Features: - Allow file systems to increase the minimum writeback chunk size. The relatively low minimal writeback size of 4MiB means that written back inodes on rotational media are switched a lot. Besides introducing additional seeks, this also can lead to extreme file fragmentation on zoned devices when a lot of files are cached relative to the available writeback bandwidth. This adds a superblock field that allows the file system to override the default size, and sets it to the zone size for zoned XFS. - Add logging for slow writeback when it exceeds sysctl_hung_task_timeout_secs. This helps identify tasks waiting for a long time and pinpoint potential issues. Recording the starting jiffies is also useful when debugging a crashed vmcore. - Wake up waiting tasks when finishing the writeback of a chunk Cleanups: - filemap_* writeback interface cleanups. Adding filemap_fdatawrite_wbc ended up being a mistake, as all but the original btrfs caller should be using better high level interfaces instead. This series removes all these low-level interfaces, switches btrfs to a more specific interface, and cleans up other too low-level interfaces. With this the writeback_control that is passed to the writeback code is only initialized in three places. - Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and filemap_fdatawrite_wbc - Add filemap_flush_nr helper for btrfs - Push struct writeback_control into start_delalloc_inodes in btrfs - Rename filemap_fdatawrite_range_kick to filemap_flush_range - Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm - Make wbc_to_tag() inline and use it in fs" * tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Make wbc_to_tag() inline and use it in fs. xfs: set s_min_writeback_pages for zoned file systems writeback: allow the file system to override MIN_WRITEBACK_PAGES writeback: cleanup writeback_chunk_size mm: rename filemap_fdatawrite_range_kick to filemap_flush_range mm: remove __filemap_fdatawrite_range mm: remove filemap_fdatawrite_wbc mm: remove __filemap_fdatawrite mm,btrfs: add a filemap_flush_nr helper btrfs: push struct writeback_control into start_delalloc_inodes btrfs: use the local tmp_inode variable in start_delalloc_inodes ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers 9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs) writeback: Wake up waiting tasks when finishing the writeback of a chunk.
2025-11-12xfs: remove xarray mark for reclaimable zonesHans Holmberg
We can easily check if there are any reclaimble zones by just looking at the used counters in the reclaim buckets, so do that to free up the xarray mark we currently use for this purpose. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-11-05xfs: fix zone selection in xfs_select_open_zone_mruChristoph Hellwig
xfs_select_open_zone_mru needs to pass XFS_ZONE_ALLOC_OK to xfs_try_use_zone because we only want to tightly pack into zones of the same or a compatible temperature instead of any available zone. This got broken in commit 0301dae732a5 ("xfs: refactor hint based zone allocation"), which failed to update this particular caller when switching to an enum. xfs/638 sometimes, but not reliably fails due to this change. Fixes: 0301dae732a5 ("xfs: refactor hint based zone allocation") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-11-05xfs: fix a rtgroup leak when xfs_init_zone failsChristoph Hellwig
Drop the rtgrop reference when xfs_init_zone fails for a conventional device. Fixes: 4e4d52075577 ("xfs: add the zoned space allocator") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-11-05xfs: use blkdev_report_zones_cached()Damien Le Moal
Modify xfs_mount_zones() to replace the call to blkdev_report_zones() with blkdev_report_zones_cached() to speed-up mount operations. Since this causes xfs_zone_validate_seq() to see zones with the BLK_ZONE_COND_ACTIVE condition, this function is also modified to acept this condition as valid. With this change, mounting a freshly formatted large capacity (30 TB) SMR HDD completes under 2s compared to over 4.7s before. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-31xfs: document another racy GC case in xfs_zoned_map_extentChristoph Hellwig
Besides blocks being invalidated, there is another case when the original mapping could have changed between querying the rmap for GC and calling xfs_zoned_map_extent. Document it there as it took us quite some time to figure out what is going on while developing the multiple-GC protection fix. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-10-29xfs: set s_min_writeback_pages for zoned file systemsChristoph Hellwig
Set s_min_writeback_pages to the zone size, so that writeback always writes up to a full zone. This ensures that writeback does not add spurious file fragmentation when writing back a large number of files that are larger than the zone size. Fixes: 4e4d52075577 ("xfs: add the zoned space allocator") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251017034611.651385-4-hch@lst.de Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-21xfs: cache open zone in inode->i_privateChristoph Hellwig
The MRU cache for open zones is unfortunately still not ideal, as it can time out pretty easily when doing heavy I/O to hard disks using up most or all open zones. One option would be to just increase the timeout, but while looking into that I realized we're just better off caching it indefinitely as there is no real downside to that once we don't hold a reference to the cache open zone. So switch the open zone to RCU freeing, and then stash the last used open zone into inode->i_private. This helps to significantly reduce fragmentation by keeping I/O localized to zones for workloads that write using many open files to HDD. Fixes: 4e4d52075577 ("xfs: add the zoned space allocator") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-10-21xfs: do not tightly pack-write large filesDamien Le Moal
When using a zoned realtime device, tightly packing of data blocks belonging to multiple closed files into the same realtime group (RTG) is very efficient at improving write performance. This is especially true with SMR HDDs as this can reduce, and even suppress, disk head seeks. However, such tight packing does not make sense for large files that require at least a full RTG. If tight packing placement is applied for such files, the VM writeback thread switching between inodes result in the large files to be fragmented, thus increasing the garbage collection penalty later when the RTG needs to be reclaimed. This problem can be avoided with a simple heuristic: if the size of the inode being written back is at least equal to the RTG size, do not use tight-packing. Modify xfs_zoned_pack_tight() to always return false in this case. With this change, a multi-writer workload writing files of 256 MB on a file system backed by an SMR HDD with 256 MB zone size as a realtime device sees all files occupying exactly one RTG (i.e. one device zone), thus completely removing the heavy fragmentation observed without this change. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-18xfs: improve default maximum number of open zonesDamien Le Moal
For regular block devices using the zoned allocator, the default maximum number of open zones is set to 1/4 of the number of realtime groups. For a large capacity device, this leads to a very large limit. E.g. with a 26 TB HDD: mount /dev/sdb /mnt ... XFS (sdb): 95836 zones of 65536 blocks size (23959 max open) In turn such large limit on the number of open zones can lead, depending on the workload, on a very large number of concurrent write streams which devices generally do not handle well, leading to poor performance. Introduce the default limit XFS_DEFAULT_MAX_OPEN_ZONES, defined as 128 to match the hardware limit of most SMR HDDs available today, and use this limit to set mp->m_max_open_zones in xfs_calc_open_zones() instead of calling xfs_max_open_zones(), when the user did not specify a limit with the max_open_zones mount option. For the 26 TB HDD example, we now get: mount /dev/sdb /mnt ... XFS (sdb): 95836 zones of 65536 blocks (128 max open zones) This change does not prevent the user from specifying a lareger number for the open zones limit. E.g. mount -o max_open_zones=4096 /dev/sdb /mnt ... XFS (sdb): 95836 zones of 65536 blocks (4096 max open zones) Finally, since xfs_calc_open_zones() checks and caps the mp->m_max_open_zones limit against the value calculated by xfs_max_open_zones() for any type of device, this new default limit does not increase m_max_open_zones for small capacity devices. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-18xfs: improve zone statistics messageDamien Le Moal
Reword the information message displayed in xfs_mount_zones() indicating the total zone count and maximum number of open zones. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: adjust the hint based zone allocation policyHans Holmberg
As we really can't make any general assumptions about files that don't have any life time hint set or are set to "NONE", adjust the allocation policy to avoid co-locating data from those files with files with a set life time. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-09-16xfs: refactor hint based zone allocationHans Holmberg
Replace the co-location code with a matrix that makes it more clear on how the decisions are made. The matrix contains scores for zone/file hint combinations. A "GOOD" score for an open zone will result in immediate co-location while "OK" combinations will only be picked if we cannot open a new zone. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-08-19xfs: remove xfs_last_used_zoneChristoph Hellwig
This was my first attempt at caching the last used zone. But it turns out for O_DIRECT or RWF_DONTCACHE that operate concurrently or in very short sequence, the bmap btree does not record a written extent yet, so it fails. Because it then still finds the last written zone it can lead to a weird ping-pong around a few zones with writers seeing different values. Remove it entirely as the later added xfs_cached_zone actually does a much better job enforcing the locality as the zone is associated with the inode in the MRU cache as soon as the zone is selected. Fixes: 4e4d52075577 ("xfs: add the zoned space allocator") Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-08-11xfs: split xfs_zone_record_blocksChristoph Hellwig
xfs_zone_record_blocks not only records successfully written blocks that now back file data, but is also used for blocks speculatively written by garbage collection that were never linked to an inode and instantly become invalid. Split the latter functionality out to be easier to understand. This also make it clear that we don't need to attach the rmap inode to a transaction for the skipped blocks case as we never dirty any peristent data structure. Also make the argument order to xfs_zone_record_blocks a bit more natural. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: improve the comments in xfs_select_zone_nowaitChristoph Hellwig
The top of the function comment is outdated, and the parts still correct duplicate information in comment inside the function. Remove the top of the function comment and instead improve a comment inside the function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: improve the comments in xfs_max_open_zonesChristoph Hellwig
Describe the rationale for the decisions a bit better. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: rename oz_write_pointer to oz_allocatedChristoph Hellwig
This member just tracks how much space we handed out for sequential write required zones. Only for conventional space it actually is the pointer where thing are written at, otherwise zone append manages that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-24xfs: use a uint32_t to cache i_used_blocks in xfs_init_zoneChristoph Hellwig
i_used_blocks is a uint32_t, so use the same value for the local variable caching it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-16xfs: move xfs_submit_zoned_bio a bitChristoph Hellwig
Commit f3e2e53823b9 ("xfs: add inode to zone caching for data placement") add the new code right between xfs_submit_zoned_bio and xfs_zone_alloc_and_submit which implement the main zoned write path. Move xfs_submit_zoned_bio down to keep it together again. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-16xfs: check for shutdown before going to sleep in xfs_select_zoneChristoph Hellwig
Ensure the file system hasn't been shut down before waiting for a free zone to become available, because that won't happen on a shut down file system. Without this processes can occasionally get stuck in the allocator wait loop when racing with a file system shutdown. This sporadically happens when running generic/388 or generic/475. Fixes: 4e4d52075577 ("xfs: add the zoned space allocator") Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: add inode to zone caching for data placementHans Holmberg
Placing data from the same file in the same zone is a great heuristic for reducing write amplification and we do this already - but only for sequential writes. To support placing data in the same way for random writes, reuse the xfs mru cache to map inodes to open zones on first write. If a mapping is present, use the open zone for data placement for this file until the zone is full. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-04-14xfs: add tunable threshold parameter for triggering zone GCHans Holmberg
Presently we start garbage collection late - when we start running out of free zones to backfill max_open_zones. This is a reasonable default as it minimizes write amplification. The longer we wait, the more blocks are invalidated and reclaim cost less in terms of blocks to relocate. Starting this late however introduces a risk of GC being outcompeted by user writes. If GC can't keep up, user writes will be forced to wait for free zones with high tail latencies as a result. This is not a problem under normal circumstances, but if fragmentation is bad and user write pressure is high (multiple full-throttle writers) we will "bottom out" of free zones. To mitigate this, introduce a zonegc_low_space tunable that lets the user specify a percentage of how much of the unused space that GC should keep available for writing. A high value will reclaim more of the space occupied by unused blocks, creating a larger buffer against write bursts. This comes at a cost as write amplification is increased. To illustrate this using a sample workload, setting zonegc_low_space to 60% avoids high (500ms) max latencies while increasing write amplification by 15%. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-18xfs: don't wake zone space waiters without m_zone_infoDarrick J. Wong
xfs_zoned_wake_all checks SB_ACTIVE to make sure it does the right thing when a shutdown happens during unmount, but it fails to account for the log recovery special case that sets SB_ACTIVE temporarily. Add a NULL check to cover both cases. Signed-off-by: Darrick J. Wong <djwong@kernel.org> [hch: added a commit log and comment] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-03xfs: support write life time based data placementHans Holmberg
Add a file write life time data placement allocation scheme that aims to minimize fragmentation and thereby to do two things: a) separate file data to different zones when possible. b) colocate file data of similar life times when feasible. To get best results, average file sizes should align with the zone capacity that is reported through the XFS_IOC_FSGEOMETRY ioctl. This improvement in data placement efficiency reduces the number of blocks requiring relocation by GC, and thus decreases overall write amplification. The impact on performance varies depending on how full the file system is. For RocksDB using leveled compaction, the lifetime hints can improve throughput for overwrite workloads at 80% file system utilization by ~10%, but for lower file system utilization there won't be as much benefit in application performance as there is less need for garbage collection to start with. Lifetime hints can be disabled using the nolifetime mount option. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: implement zoned garbage collectionChristoph Hellwig
RT groups on a zoned file system need to be completely empty before their space can be reused. This means that partially empty groups need to be emptied entirely to free up space if no entirely free groups are available. Add a garbage collection thread that moves all data out of the least used zone when not enough free zones are available, and which resets all zones that have been emptied. To find empty zone a simple set of 10 buckets based on the amount of space used in the zone is used. To empty zones, the rmap is walked to find the owners and the data is read and then written to the new place. To automatically defragment files the rmap records are sorted by inode and logical offset. This means defragmentation of parallel writes into a single zone happens automatically when performing garbage collection. Because holding the iolock over the entire GC cycle would inject very noticeable latency for other accesses to the inodes, the iolock is not taken while performing I/O. Instead the I/O completion handler checks that the mapping hasn't changed over the one recorded at the start of the GC cycle and doesn't update the mapping if it change. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: add support for zoned space reservationsChristoph Hellwig
For zoned file systems garbage collection (GC) has to take the iolock and mmaplock after moving data to a new place to synchronize with readers. This means waiting for garbage collection with the iolock can deadlock. To avoid this, the worst case required blocks have to be reserved before taking the iolock, which is done using a new RTAVAILABLE counter that tracks blocks that are free to write into and don't require garbage collection. The new helpers try to take these available blocks, and if there aren't enough available it wakes and waits for GC. This is done using a list of on-stack reservations to ensure fairness. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: add the zoned space allocatorChristoph Hellwig
For zoned RT devices space is always allocated at the write pointer, that is right after the last written block and only recorded on I/O completion. Because the actual allocation algorithm is very simple and just involves picking a good zone - preferably the one used for the last write to the inode. As the number of zones that can written at the same time is usually limited by the hardware, selecting a zone is done as late as possible from the iomap dio and buffered writeback bio submissions helpers just before submitting the bio. Given that the writers already took a reservation before acquiring the iolock, space will always be readily available if an open zone slot is available. A new structure is used to track these open zones, and pointed to by the xfs_rtgroup. Because zoned file systems don't have a rsum cache the space for that pointer can be reused. Allocations are only recorded at I/O completion time. The scheme used for that is very similar to the reflink COW end I/O path. Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>