summaryrefslogtreecommitdiff
path: root/fs/btrfs/zstd.c
AgeCommit message (Collapse)Author
2025-09-23btrfs: add unlikely annotations to branches leading to EIODavid Sterba
The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen, where EIO is one of them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: prepare zstd to support bs > ps casesQu Wenruo
This involves converting the following functions to use proper folio sizes/shifts: - zstd_compress_folios() - zstd_decompress_bio() The function zstd_decompress() is already using block size correctly without using page size, thus it needs no modification. And since zstd compression is calling kmap_local_folio(), the existing code cannot handle large folios with HIGHMEM, as kmap_local_folio() requires us to handle one page range each time. I do not really think it's worth to spend time on some feature that will be deprecated eventually. So here just add an extra explicit rejection for bs > ps with HIGHMEM feature enabled kernels. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: prepare compression folio alloc/free for bs > ps casesQu Wenruo
This includes the following preparation for bs > ps cases: - Always alloc/free the folio directly if bs > ps This adds a new @fs_info parameter for btrfs_alloc_compr_folio(), thus affecting all compression algorithms. For btrfs_free_compr_folio() it needs no parameter for now, as we can use the folio size to skip the caching part. For now the change is just to passing a @fs_info into the function, all the folio size assumption is still based on page size. - Properly zero the last folio in compress_file_range() Since the compressed folios can be larger than a page, we need to properly zero the whole folio. - Use correct folio size for btrfs_add_compressed_bio_folios() Instead of page size, use the correct folio size. - Use correct folio size/shift for btrfs_compress_filemap_get_folio() As we are not only using simple page sized folios anymore. - Use correct folio size for btrfs_decompress() There is an ASSERT() making sure the decompressed range is no larger than a page, which will be triggered for bs > ps cases. - Skip readahead for compressed pages Similar to subpage cases. - Make btrfs_alloc_folio_array() to accept a new @order parameter - Add a helper to calculate the minimal folio size All those changes should not affect the existing bs <= ps handling. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: reduce compression workspace buffer space to block sizeQu Wenruo
Currently the compression workspace buffer size is always based on PAGE_SIZE, but btrfs has support subpage sized block size for some time. This means for one-shot compression algorithm like lzo, we're wasting quite some memory if the block size is smaller than page size, as the LZO only works on one block (thus one-shot). On 64K page sized systems with 4K block size, it means we only need at most 8K buffer space for lzo, but in reality we're allocating 64K buffer. So to reduce the memory usage, change all workspace buffer to base its size based on block size other than page size. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: rename btrfs_compress_op to btrfs_compress_levelsQu Wenruo
Since all workspace managers are per-fs, there is no need nor no way to store them inside btrfs_compress_op::wsm anymore. With that said, we can do the following modifications: - Remove zstd_workspace_mananger::ops Zstd always grab the global btrfs_compress_op[]. - Remove btrfs_compress_op::wsm member - Rename btrfs_compress_op to btrfs_compress_levels This should make it more clear that btrfs_compress_levels structures are only to indicate the levels of each compress algorithm. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: cleanup the per-module compression workspace managersQu Wenruo
Since all workspaces are handled by the per-fs workspace managers, we can safely remove the old per-module managers. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: migrate to use per-fs workspace managerQu Wenruo
There are several interfaces involved for each algorithm: - alloc workspace All algorithms allocate a workspace without the need for workspace manager. So no change needs to be done. - get workspace This involves checking the workspace manager to find a free one, and if not, allocate a new one. For none and lzo, they share the same generic btrfs_get_workspace() helper, only needs to update that function to use the per-fs manager. For zlib it uses a wrapper around btrfs_get_workspace(), so no special work needed. For zstd, update zstd_find_workspace() and zstd_get_workspace() to utilize the per-fs manager. - put workspace For none/zlib/lzo they share the same btrfs_put_workspace(), update that function to use the per-fs manager. For zstd, it's zstd_put_workspace(), the same update. - zstd specific timer This is the timer to reclaim workspace, change it to grab the per-fs workspace manager instead. Now all workspace are managed by the per-fs manager. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: add workspace manager initialization for zstdQu Wenruo
This involves: - Add zstd_alloc_workspace_manager() and zstd_free_workspace_manager() Those two functions will accept an fs_info pointer, and alloc/free fs_info->compr_wsm[BTRFS_COMPRESS_ZSTD] pointer. - Add btrfs_alloc_compress_wsm() and btrfs_free_compress_wsm() Those are helpers allocating the workspace managers for all algorithms. For now only zstd is supported, and the timing is a little unusual, the btrfs_alloc_compress_wsm() should only be called after the sectorsize being initialized. Meanwhile btrfs_free_fs_info_compress() is called in btrfs_free_fs_info(). - Move the definition of btrfs_compression_type to "fs.h" The reason is that "compression.h" has already included "fs.h", thus we can not just include "compression.h" to get the definition of BTRFS_NR_COMPRESS_TYPES to define fs_info::compr_wsm[]. For now the per-fs zstd workspace manager won't really have any effect, and all compression is still going through the global workspace manager. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: add an fs_info parameter for compression workspace managerQu Wenruo
[BACKGROUND] Currently btrfs shares workspaces and their managers for all filesystems, this is mostly fine as all those workspaces are using page size based buffers, and btrfs only support block size (bs) <= page size (ps). This means even if bs < ps, we at most waste some buffer space in the workspace, but everything will still work fine. The problem here is that is limiting our support for bs > ps cases. As now a workspace now may need larger buffer to handle bs > ps cases, but since the pool has no way to distinguish different workspaces, a regular workspace (which is still using buffer size based on ps) can be passed to a btrfs whose bs > ps. In that case the buffer is not large enough, and will cause various problems. [ENHANCEMENT] To prepare for the per-fs workspace migration, add an fs_info parameter to all workspace related functions. For now this new fs_info parameter is not yet utilized. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22btrfs: use blocksize to check if compression is making things largerQu Wenruo
[BEHAVIOR DIFFERENCE BETWEEN COMPRESSION ALGOS] Currently LZO compression algorithm will check if we're making the compressed data larger after compressing more than 2 blocks. But zlib and zstd do the same checks after compressing more than 8192 bytes. This is not a big deal, but since we're already supporting larger block size (e.g. 64K block size if page size is also 64K), this check is not suitable for all block sizes. For example, if our page and block size are both 16KiB, and after the first block compressed using zlib, the resulted compressed data is slightly larger than 16KiB, we will immediately abort the compression. This makes zstd and zlib compression algorithms to behave slightly different from LZO, which only aborts after compressing two blocks. [ENHANCEMENT] To unify the behavior, only abort the compression after compressing at least two blocks. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22btrfs: pass btrfs_inode pointer directly into btrfs_compress_folios()Qu Wenruo
For the 3 supported compression algorithms, two of them (zstd and zlib) are already grabbing the btrfs inode for error messages. It's more common to pass btrfs_inode and grab the address space from it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: use our message helpers instead of pr_err/pr_warn/pr_infoDavid Sterba
Our message helpers accept NULL for the fs_info in the context that does not provide and print the common header of the message. The use of pr_* helpers is only for special reasons, like module loading, device scanning or multi-line output (print-tree). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use unsigned types for constants defined as bit shiftsDavid Sterba
The unsigned type is a recommended practice (CWE-190, CWE-194) for bit shifts to avoid problems with potential unwanted sign extensions. Although there are no such cases in btrfs codebase, follow the recommendation. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: prepare compression paths for large data foliosQu Wenruo
All compression algorithms inside btrfs are not supporting large folios due to the following points: - btrfs_calc_input_length() is assuming page sized folio - kmap_local_folio() usages are using offset_in_page() Prepare them to support large data folios by: - Add a folio parameter to btrfs_calc_input_length() And use that folio parameter to calculate the correct length. Since we're here, also add extra ASSERT()s to make sure the parameter @cur is inside the folio range. This affects only zlib and zstd. Lzo compresses at most one block at a time, thus not affected. - Use offset_in_folio() to calculate the kmap_local_folio() offset This affects all 3 algorithms. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-05treewide: Switch/rename to timer_delete[_sync]()Thomas Gleixner
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-18btrfs: zstd: remove local variable for storing page offsetsDavid Sterba
When using offset_in_page() it's clear what it means, we don't need to store it in the local variable just to use it right away. There's no change in the generated code, but keeps the declarations smaller. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: zstd: move zstd_parameters to the workspaceDavid Sterba
Reduce stack consumption of zstd_compress_folios() by 40 bytes (10*sizeof(int)) as we can store struct zstd_parameters in the workspace that is reused for each call. typedef struct { ZSTD_compressionParameters cParams; ZSTD_frameParameters fParams; } ZSTD_parameters; typedef struct { unsigned windowLog; unsigned chainLog; unsigned hashLog; unsigned searchLog; unsigned minMatch; unsigned targetLength; ZSTD_strategy strategy; } ZSTD_compressionParameters; typedef struct { int contentSizeFlag; int checksumFlag; int noDictIDFlag; } ZSTD_frameParameters; Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: zstd: enable negative compression levels mount optionDaniel Vacek
Allow using the fast modes (negative compression levels) of zstd as a mount option. As per the results, the compression ratio is (expectedly) lower: for level in {-15..-1} 1 2 3; \ do printf "level %3d\n" $level; \ mount -o compress=zstd:$level /dev/sdb /mnt/test/; \ grep sdb /proc/mounts; \ cp -r /usr/bin /mnt/test/; sync; compsize /mnt/test/bin; \ cp -r /usr/share/doc /mnt/test/; sync; compsize /mnt/test/doc; \ cp enwik9 /mnt/test/; sync; compsize /mnt/test/enwik9; \ cp linux-6.13.tar /mnt/test/; sync; compsize /mnt/test/linux-6.13.tar; \ rm -r /mnt/test/{bin,doc,enwik9,linux-6.13.tar}; \ umount /mnt/test/; \ done |& tee results | \ awk '/^level/{print}/^TOTAL/{print$3"\t"$2" |"}' | paste - - - - - 266M bin | 45M doc | 953M wiki | 1.4G source =============================+===============+===============+===============+ level -15 180M 67% | 30M 68% | 694M 72% | 598M 40% | level -14 180M 67% | 30M 67% | 683M 71% | 581M 39% | level -13 177M 66% | 29M 66% | 671M 70% | 566M 38% | level -12 174M 65% | 29M 65% | 658M 69% | 548M 37% | level -11 174M 65% | 28M 64% | 645M 67% | 530M 35% | level -10 171M 64% | 28M 62% | 631M 66% | 512M 34% | level -9 165M 62% | 27M 61% | 615M 64% | 493M 33% | level -8 161M 60% | 27M 59% | 598M 62% | 475M 32% | level -7 155M 58% | 26M 58% | 582M 61% | 457M 30% | level -6 151M 56% | 25M 56% | 565M 59% | 437M 29% | level -5 145M 54% | 24M 55% | 545M 57% | 417M 28% | level -4 139M 52% | 23M 52% | 520M 54% | 391M 26% | level -3 135M 50% | 22M 50% | 495M 51% | 369M 24% | level -2 127M 47% | 22M 48% | 470M 49% | 349M 23% | level -1 120M 45% | 21M 47% | 452M 47% | 332M 22% | level 1 110M 41% | 17M 39% | 362M 38% | 290M 19% | level 2 106M 40% | 17M 38% | 349M 36% | 288M 19% | level 3 104M 39% | 16M 37% | 340M 35% | 276M 18% | The samples represent some data sets that can be commonly found and show approximate compressibility. The fast levels trade off speed for ratio and are best suitable for highly compressible data. As can be seen above, comparing the results to the current default zstd level 3, the negative levels are roughly 2x worse at -15 and the ratio increases almost linearly with each level. Signed-off-by: Daniel Vacek <neelx@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: zstd: assert the timer pointer in callbackDavid Sterba
Make sure we got the right timer struct for the zstd workspace reclaim work. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: zstd: make the compression path to handle sector size < page sizeQu Wenruo
Inside zstd_compress_folios(), after exhausted one input page, we need to switch to the next page as input. However when counting the total input bytes (@tot_in), we always increase it by PAGE_SIZE. For the following case, it can cause incorrect value: 0 32K 64K 96K | |///////////||///////////| After compressing range [32K, 64K), we switch to the next page, and increasing @tot_in by 64K, while we only read 32K. This will cause the @total_in to return a value larger than the input length. Fix it by only increase @tot_in by the input size. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: convert zstd_decompress() to take a folioLi Zetao
The old page API is being gradually replaced and converted to use folio to improve code readability and avoid repeated conversion between page and folio. And memcpy_to_page() can be replaced with memcpy_to_folio(). But there is no memzero_folio(), but it can be replaced equivalently by folio_zero_range(). Signed-off-by: Li Zetao <lizetao1@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: make compression path to be subpage compatibleQu Wenruo
Currently btrfs compression path is not really subpage compatible, every thing is still done in page unit. That's fine for regular sector size and subpage routine. As even for subpage routine compression is only enabled if the whole range is page aligned, so reading the page cache in page unit is totally fine. However in preparation for the future subpage perfect compression support, we need to change the compression routine to properly handle a subpage range. This patch would prepare both zlib and zstd to only read the subpage range for compression. Lzo is already doing subpage aware read, as lzo's on-disk format is already sectorsize dependent. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: enhance compression error messagesDavid Sterba
Add more verbose and specific messages to all main error points in compression code for all algorithms. Currently there's no way to know which inode is affected or where in the data errors happened. The messages follow a common format: - what happened - error code if relevant - root and inode - additional data like offsets or lengths There's no helper for the messages as they differ in some details and that would be cumbersome to generalize to a single function. As all the errors are "almost never happens" there are the unlikely annotations done as compression is hot path. Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: compression: migrate compression/decompression paths to foliosQu Wenruo
For both compression and decompression paths, we always require a "struct page **pages" and "unsigned long nr_pages", this involves quite some part of the btrfs compression paths: - All the compression entry points - compressed_bio structure This affects both compression and decompression. - async_extent structure Unfortunately with all those involved parts, there is no good way to split the conversion into smaller patches while still passing compiling. So do this in one big conversion in one go. Please note this is direct page->folio conversion, no change on the page sized folio requirement yet. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: compression: convert page allocation to folio interfacesQu Wenruo
Currently we have two wrappers to allocate and free a page for compression usage: - btrfs_alloc_compr_page() - btrfs_free_compr_page() The allocator would try to grab a page from the pool, and only allocate a new page if the pool is empty. The reclaimer would check if the pool is full, and if not full it would put the page into the pool. This patch converts both helpers to use folio interfaces, and allowing further conversion of compression path to folios. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: compression: add error handling for missed page cacheQu Wenruo
For all the supported compression algorithms, the compression path would always need to grab the page cache, then do the compression. Normally we would get a page reference without any problem, since the write path should have already locked the pages in the write range. For the sake of error handling, we should handle the page cache miss case. Adds a common wrapper, btrfs_compress_find_get_page(), which calls find_get_page(), and do the error handling along with an error message. Callers inside compression path would only need to call btrfs_compress_find_get_page(), and error out if it returned any error. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: zstd: fix and simplify the inline extent decompression (v2)Qu Wenruo
Note: this is a fixed version that was previously reverted as e01a83e12604 ("Revert "btrfs: zstd: fix and simplify the inline extent decompression""), with fixed parameters to memzero_page(). [BUG] If we have a filesystem with 4k sectorsize, and an inlined compressed extent created like this: item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160 generation 8 transid 8 size 4096 nbytes 4096 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24 index 2 namelen 14 name: source_inlined item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69 generation 8 type 0 (inline) inline extent data size 48 ram_bytes 4096 compression 3 (zstd) Then trying to reflink that extent in an aarch64 system with 64K page size, the reflink would just fail: # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest XFS_IOC_CLONE_RANGE: Input/output error [CAUSE] In zstd_decompress(), we didn't treat @start_byte as just a page offset, but also use it as an indicator on whether we should error out, without any proper explanation (this is copied from other decompression code). In reality, for subpage cases, although @start_byte can be non-zero, we should never switch input/output buffer nor error out, since the whole input/output buffer should never exceed one sector, thus we should not need to do any buffer switch. Thus the current code using @start_byte as a condition to switch input/output buffer or finish the decompression is completely incorrect. [FIX] The fix involves several modification: - Rename @start_byte to @dest_pgoff to properly express its meaning - Use @sectorsize other than PAGE_SIZE to properly initialize the output buffer size - Use correct destination offset inside the destination page - Simplify the main loop Since the input/output buffer should never switch, we only need one zstd_decompress_stream() call. - Consider early end as an error After the fix, even on 64K page sized aarch64, above reflink now works as expected: # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest linked 4096/4096 bytes at offset 61440 And results the correct file layout: item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160 generation 10 transid 10 size 65536 nbytes 4096 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14 index 3 namelen 4 name: dest item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83 location key (0 UNKNOWN.0 0) type XATTR transid 10 data_len 37 name_len 16 name: security.selinux data unconfined_u:object_r:unlabeled_t:s0 item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53 generation 10 type 1 (regular) extent data disk byte 13631488 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove unused included headersDavid Sterba
With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-22Revert "btrfs: zstd: fix and simplify the inline extent decompression"Linus Torvalds
This reverts commit 1e7f6def8b2370ecefb54b3c8f390ff894b0c51b. It causes my machine to not even boot, and Klara Modin reports that the cause is that small zstd-compressed files return garbage when read. Reported-by: Klara Modin <klarasmodin@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CABq1_vj4GpUeZpVG49OHCo-3sdbe2-2ROcu_xDvUG-6-5zPRXg@mail.gmail.com/ Reported-and-bisected-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Sterba <dsterba@suse.com> Cc: Qu Wenruo <wqu@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-01-18btrfs: zstd: fix and simplify the inline extent decompressionQu Wenruo
[BUG] If we have a filesystem with 4k sectorsize, and an inlined compressed extent created like this: item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160 generation 8 transid 8 size 4096 nbytes 4096 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24 index 2 namelen 14 name: source_inlined item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69 generation 8 type 0 (inline) inline extent data size 48 ram_bytes 4096 compression 3 (zstd) Then trying to reflink that extent in an aarch64 system with 64K page size, the reflink would just fail: # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest XFS_IOC_CLONE_RANGE: Input/output error [CAUSE] In zstd_decompress(), we didn't treat @start_byte as just a page offset, but also use it as an indicator on whether we should error out, without any proper explanation (this is copied from other decompression code). In reality, for subpage cases, although @start_byte can be non-zero, we should never switch input/output buffer nor error out, since the whole input/output buffer should never exceed one sector, thus we should not need to do any buffer switch. Thus the current code using @start_byte as a condition to switch input/output buffer or finish the decompression is completely incorrect. [FIX] The fix involves several modification: - Rename @start_byte to @dest_pgoff to properly express its meaning - Use @sectorsize other than PAGE_SIZE to properly initialize the output buffer size - Use correct destination offset inside the destination page - Simplify the main loop Since the input/output buffer should never switch, we only need one zstd_decompress_stream() call. - Consider early end as an error After the fix, even on 64K page sized aarch64, above reflink now works as expected: # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest linked 4096/4096 bytes at offset 61440 And results the correct file layout: item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160 generation 10 transid 10 size 65536 nbytes 4096 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 1 flags 0x0(none) item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14 index 3 namelen 4 name: dest item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83 location key (0 UNKNOWN.0 0) type XATTR transid 10 data_len 37 name_len 16 name: security.selinux data unconfined_u:object_r:unlabeled_t:s0 item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53 generation 10 type 1 (regular) extent data disk byte 13631488 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: use page alloc/free wrappers for compression pagesDavid Sterba
This is a preparation for managing compression pages in a cache-like manner, instead of asking the allocator each time. The common allocation and free wrappers are introduced and are functionally equivalent to the current code. The freeing helpers need to be carefully placed where the last reference is dropped. This is either after directly allocating (error handling) or when there are no other users of the pages (after copying the contents). It's safe to not use the helper and use put_page() that will handle the reference count. Not using the helper means there's lower number of pages that could be reused without passing them back to allocator. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: reformat remaining kdoc style commentsDavid Sterba
Function name in the comment does not bring much value to code not exposed as API and we don't stick to the kdoc format anymore. Update formatting of parameter descriptions. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: disable allocation warnings for compression workspacesDavid Sterba
The workspaces for compression are typically much larger than a page and for high zstd levels in the range of megabytes. There's a fallback to vmalloc but this can still fail (see the report). Some of the workspaces are preallocated at module load time so we have a safe fallback, otherwise when a new workspace is needed it's allocated but if this fails then the process waits. Which means the warning is only causing noise and we can use the GFP flag to disable it. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=217466 Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: move zero filling of compressed read bios into common codeChristoph Hellwig
All algorithms have to fill the remainder of the orig_bio with zeroes, so do it in common code. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: constify input buffer parameter in compression codeDavid Sterba
The input buffers passed down to compression must never be changed, switch type to u8 as it's a raw byte buffer and use const. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: update function commentsDavid Sterba
Update, reformat or reword function comments. This also removes the kdoc marker so we don't get reports when the function name is missing. Changes made: - remove kdoc markers - reformat the brief description to be a proper sentence - reword to imperative voice - align parameter list - fix typos Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: zstd: replace kmap() with kmap_local_page()Fabio M. De Francesco
The use of kmap() is being deprecated in favor of kmap_local_page(). With kmap_local_page(), the mapping is per thread, CPU local and not globally visible. Therefore, use kmap_local_page() / kunmap_local() in zstd.c because in this file the mappings are per thread and are not visible in other contexts. In the meanwhile use plain page_address() on output pages allocated with the GFP_NOFS flag instead of calling kmap*() on them (since they are always allocated from ZONE_NORMAL). Tested with xfstests on QEMU + KVM 32 bits VM with 4GB of RAM, booting a kernel with HIGHMEM64G enabled. Suggested-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: use non-bh spin_lock in zstd timer callbackSchspa Shi
This is an optimization for fix fee13fe96529 ("btrfs: correct zstd workspace manager lock to use spin_lock_bh()") The critical region for wsm.lock is only accessed by the process context and the softirq context. Because in the soft interrupt, the critical section will not be preempted by the soft interrupt again, there is no need to call spin_lock_bh(&wsm.lock) to turn off the soft interrupt, spin_lock(&wsm.lock) is enough for this situation. Signed-off-by: Schspa Shi <schspa@gmail.com> [ minor comment update ] Signed-off-by: David Sterba <dsterba@suse.com>
2021-11-08lib: zstd: Add kernel-specific APINick Terrell
This patch: - Moves `include/linux/zstd.h` -> `include/linux/zstd_lib.h` - Updates modified zstd headers to yearless copyright - Adds a new API in `include/linux/zstd.h` that is functionally equivalent to the in-use subset of the current API. Functions are renamed to avoid symbol collisions with zstd, to make it clear it is not the upstream zstd API, and to follow the kernel style guide. - Updates all callers to use the new API. There are no functional changes in this patch. Since there are no functional change, I felt it was okay to update all the callers in a single patch. Once the API is approved, the callers are mechanically changed. This patch is preparing for the 3rd patch in this series, which updates zstd to version 1.4.10. Since the upstream zstd API is no longer exposed to callers, the update can happen transparently. Signed-off-by: Nick Terrell <terrelln@fb.com> Tested By: Paul Jones <paul@pauljones.id.au> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # LLVM/Clang v13.0.0 on x86-64 Tested-by: Jean-Denis Girard <jd.girard@sysnux.pf>
2021-10-29Revert "btrfs: compression: drop kmap/kunmap from zstd"David Sterba
This reverts commit bbaf9715f3f5b5ff0de71da91fcc34ee9c198ed8. The kmaps in compression code are still needed and cause crashes on 32bit machines (ARM, x86). Reproducible eg. by running fstest btrfs/004 with enabled LZO or ZSTD compression. Example stacktrace with ZSTD on a 32bit ARM machine: Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = c4159ed3 [00000000] *pgd=00000000 Internal error: Oops: 5 [#1] PREEMPT SMP ARM Modules linked in: CPU: 0 PID: 210 Comm: kworker/u2:3 Not tainted 5.14.0-rc79+ #12 Hardware name: Allwinner sun4i/sun5i Families Workqueue: btrfs-delalloc btrfs_work_helper PC is at mmiocpy+0x48/0x330 LR is at ZSTD_compressStream_generic+0x15c/0x28c (mmiocpy) from [<c0629648>] (ZSTD_compressStream_generic+0x15c/0x28c) (ZSTD_compressStream_generic) from [<c06297dc>] (ZSTD_compressStream+0x64/0xa0) (ZSTD_compressStream) from [<c049444c>] (zstd_compress_pages+0x170/0x488) (zstd_compress_pages) from [<c0496798>] (btrfs_compress_pages+0x124/0x12c) (btrfs_compress_pages) from [<c043c068>] (compress_file_range+0x3c0/0x834) (compress_file_range) from [<c043c4ec>] (async_cow_start+0x10/0x28) (async_cow_start) from [<c0475c3c>] (btrfs_work_helper+0x100/0x230) (btrfs_work_helper) from [<c014ef68>] (process_one_work+0x1b4/0x418) (process_one_work) from [<c014f210>] (worker_thread+0x44/0x524) (worker_thread) from [<c0156aa4>] (kthread+0x180/0x1b0) (kthread) from [<c0100150>] Link: https://lore.kernel.org/all/CAJCQCtT+OuemovPO7GZk8Y8=qtOObr0XTDp8jh4OHD6y84AFxw@mail.gmail.com/ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214839 Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: rework btrfs_decompress_buf2page()Qu Wenruo
There are several bugs inside the function btrfs_decompress_buf2page() - @start_byte doesn't take bvec.bv_offset into consideration Thus it can't handle case where the target range is not page aligned. - Too many helper variables There are tons of helper variables, @buf_offset, @current_buf_start, @start_byte, @prev_start_byte, @working_bytes, @bytes. This hurts anyone who wants to read the function. - No obvious main cursor for the iteartion A new problem caused by previous problem. - Comments for parameter list makes no sense Like @buf_start is the offset to @buf, or offset inside the full decompressed extent? (Spoiler alert, the later case) And @total_out acts more like @buf_start + @size_of_buf. The worst is @disk_start. The real meaning of it is the file offset of the full decompressed extent. This patch will rework the whole function by: - Add a proper comment with ASCII art to explain the parameter list - Rework parameter list The old @buf_start is renamed to @decompressed, to show how many bytes are already decompressed inside the full decompressed extent. The old @total_out is replaced by @buf_len, which is the decompressed data size. For old @disk_start and @bio, just pass @compressed_bio in. - Use single main cursor The main cursor will be @cur_file_offset, to show what's the current file offset. Other helper variables will be declared inside the main loop, and only minimal amount of helper variables: * offset_inside_decompressed_buf: The only real helper * copy_start_file_offset: File offset we start memcpy * bvec_file_offset: File offset of current bvec Even with all these extensive comments, the final function is still smaller than the original function, which is definitely a win. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: compression: drop kmap/kunmap from zstdDavid Sterba
As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is simply page_address and kunmap is a no-op. Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: drop from __GFP_HIGHMEM all allocationsDavid Sterba
The highmem flag is used for allocating pages for compression and for raid56 pages. The high memory makes sense on 32bit systems but is not without problems. On 64bit system's it's just another layer of wrappers. The time the pages are allocated for compression or raid56 is relatively short (about a transaction commit), so the pages are not blocked indefinitely. As the number of pages depends on the amount of data being written/read, there's a theoretical problem. A fast device on a 32bit system could use most of the low memory pool, while with the highmem allocation that would not happen. This was possibly the original idea long time ago, but nowadays we optimize for 64bit systems. This patch removes all usage of the __GFP_HIGHMEM flag for page allocation, the kmap/kunmap are still in place and will be removed in followup patches. Remaining is masking out the bit in alloc_extent_state and __lookup_free_space_inode, that can safely stay. Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-05btrfs: use memzero_page() instead of open coded kmap patternIra Weiny
There are many places where kmap/memset/kunmap patterns occur. Use the newly lifted memzero_page() to eliminate direct uses of kmap and leverage the new core functions use of kmap_local_page(). The development of this patch was aided by the following coccinelle script: // <smpl> // SPDX-License-Identifier: GPL-2.0-only // Find kmap/memset/kunmap pattern and replace with memset*page calls // // NOTE: Offsets and other expressions may be more complex than what the script // will automatically generate. Therefore a catchall rule is provided to find // the pattern which then must be evaluated by hand. // // Confidence: Low // Copyright: (C) 2021 Intel Corporation // URL: http://coccinelle.lip6.fr/ // Comments: // Options: // // Then the memset pattern // @ memset_rule1 @ expression page, V, L, Off; identifier ptr; type VP; @@ ( -VP ptr = kmap(page); | -ptr = kmap(page); | -VP ptr = kmap_atomic(page); | -ptr = kmap_atomic(page); ) <+... ( -memset(ptr, 0, L); +memzero_page(page, 0, L); | -memset(ptr + Off, 0, L); +memzero_page(page, Off, L); | -memset(ptr, V, L); +memset_page(page, V, 0, L); | -memset(ptr + Off, V, L); +memset_page(page, V, Off, L); ) ...+> ( -kunmap(page); | -kunmap_atomic(ptr); ) // Remove any pointers left unused @ depends on memset_rule1 @ identifier memset_rule1.ptr; type VP, VP1; @@ -VP ptr; ... when != ptr; ? VP1 ptr; // // Catch all // @ memset_rule2 @ expression page; identifier ptr; expression GenTo, GenSize, GenValue; type VP; @@ ( -VP ptr = kmap(page); | -ptr = kmap(page); | -VP ptr = kmap_atomic(page); | -ptr = kmap_atomic(page); ) <+... ( // // Some call sites have complex expressions within the memset/memcpy // The follow are catch alls which need to be evaluated by hand. // -memset(GenTo, 0, GenSize); +memzero_pageExtra(page, GenTo, GenSize); | -memset(GenTo, GenValue, GenSize); +memset_pageExtra(page, GenValue, GenTo, GenSize); ) ...+> ( -kunmap(page); | -kunmap_atomic(ptr); ) // Remove any pointers left unused @ depends on memset_rule2 @ identifier memset_rule2.ptr; type VP, VP1; @@ -VP ptr; ... when != ptr; ? VP1 ptr; // </smpl> Link: https://lkml.kernel.org/r/20210309212137.2610186-4-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26btrfs: use memcpy_[to|from]_page() and kmap_local_page()Ira Weiny
There are many places where the pattern kmap/memcpy/kunmap occurs. This pattern was lifted to the core common functions memcpy_[to|from]_page(). Use these new functions to reduce the code, eliminate direct uses of kmap, and leverage the new core functions use of kmap_local_page(). Also, there is 1 place where a kmap/memcpy is followed by an optional memset. Here we leave the kmap open coded to avoid remapping the page but use kmap_local_page() directly. Development of this patch was aided by the coccinelle script: // <smpl> // SPDX-License-Identifier: GPL-2.0-only // Find kmap/memcpy/kunmap pattern and replace with memcpy*page calls // // NOTE: Offsets and other expressions may be more complex than what the script // will automatically generate. Therefore a catchall rule is provided to find // the pattern which then must be evaluated by hand. // // Confidence: Low // Copyright: (C) 2021 Intel Corporation // URL: http://coccinelle.lip6.fr/ // Comments: // Options: // // simple memcpy version // @ memcpy_rule1 @ expression page, T, F, B, Off; identifier ptr; type VP; @@ ( -VP ptr = kmap(page); | -ptr = kmap(page); | -VP ptr = kmap_atomic(page); | -ptr = kmap_atomic(page); ) <+... ( -memcpy(ptr + Off, F, B); +memcpy_to_page(page, Off, F, B); | -memcpy(ptr, F, B); +memcpy_to_page(page, 0, F, B); | -memcpy(T, ptr + Off, B); +memcpy_from_page(T, page, Off, B); | -memcpy(T, ptr, B); +memcpy_from_page(T, page, 0, B); ) ...+> ( -kunmap(page); | -kunmap_atomic(ptr); ) // Remove any pointers left unused @ depends on memcpy_rule1 @ identifier memcpy_rule1.ptr; type VP, VP1; @@ -VP ptr; ... when != ptr; ? VP1 ptr; // // Some callers kmap without a temp pointer // @ memcpy_rule2 @ expression page, T, Off, F, B; @@ <+... ( -memcpy(kmap(page) + Off, F, B); +memcpy_to_page(page, Off, F, B); | -memcpy(kmap(page), F, B); +memcpy_to_page(page, 0, F, B); | -memcpy(T, kmap(page) + Off, B); +memcpy_from_page(T, page, Off, B); | -memcpy(T, kmap(page), B); +memcpy_from_page(T, page, 0, B); ) ...+> -kunmap(page); // No need for the ptr variable removal // // Catch all // @ memcpy_rule3 @ expression page; expression GenTo, GenFrom, GenSize; identifier ptr; type VP; @@ ( -VP ptr = kmap(page); | -ptr = kmap(page); | -VP ptr = kmap_atomic(page); | -ptr = kmap_atomic(page); ) <+... ( // // Some call sites have complex expressions within the memcpy // match a catch all to be evaluated by hand. // -memcpy(GenTo, GenFrom, GenSize); +memcpy_to_pageExtra(page, GenTo, GenFrom, GenSize); +memcpy_from_pageExtra(GenTo, page, GenFrom, GenSize); ) ...+> ( -kunmap(page); | -kunmap_atomic(ptr); ) // Remove any pointers left unused @ depends on memcpy_rule3 @ identifier memcpy_rule3.ptr; type VP, VP1; @@ -VP ptr; ... when != ptr; ? VP1 ptr; // <smpl> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: inline free_workspaceDavid Sterba
Replace indirect calls to free_workspace by switch and calls to the specific callbacks. This is mainly to get rid of the indirection due to spectre vulnerability mitigations. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: inline alloc_workspaceDavid Sterba
Replace indirect calls to alloc_workspace by switch and calls to the specific callbacks. This is mainly to get rid of the indirection due to spectre vulnerability mitigations. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: inline put_workspaceDavid Sterba
Similar to get_workspace, majority of the callbacks is trivial, we don't gain anything by the indirection, so replace them by a switch function. Trivial callback implementations use the helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: inline get_workspaceDavid Sterba
Majority of the callbacks is trivial, we don't gain anything by the indirection, so replace them by a switch function. ZLIB needs to adjust level in the callback and ZSTD workspace management is complex, the rest is call to the helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: export alloc/free/get/put callbacks of all algosDavid Sterba
The indirect calls will be replaced by a switch in compression.c. (Switch is faster than indirect calls with when Spectre mitigations are enabled). Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>