summaryrefslogtreecommitdiff
path: root/Documentation/admin-guide/device-mapper/dm-pcache.rst
blob: 09d327ef4b14ee3e551ea9e29a97c14c0e97ee71 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
.. SPDX-License-Identifier: GPL-2.0

=================================
dm-pcache — Persistent Cache
=================================

*Author: Dongsheng Yang <dongsheng.yang@linux.dev>*

This document describes *dm-pcache*, a Device-Mapper target that lets a
byte-addressable *DAX* (persistent-memory, “pmem”) region act as a
high-performance, crash-persistent cache in front of a slower block
device.  The code lives in `drivers/md/dm-pcache/`.

Quick feature summary
=====================

* *Write-back* caching (only mode currently supported).
* *16 MiB segments* allocated on the pmem device.
* *Data CRC32* verification (optional, per cache).
* Crash-safe: every metadata structure is duplicated (`PCACHE_META_INDEX_MAX
  == 2`) and protected with CRC+sequence numbers.
* *Multi-tree indexing* (indexing trees sharded by logical address) for high PMem parallelism
* Pure *DAX path* I/O – no extra BIO round-trips
* *Log-structured write-back* that preserves backend crash-consistency


Constructor
===========

::

    pcache <cache_dev> <backing_dev> [<number_of_optional_arguments> <cache_mode writeback> <data_crc true|false>]

=========================  ====================================================
``cache_dev``               Any DAX-capable block device (``/dev/pmem0``…).
                            All metadata *and* cached blocks are stored here.

``backing_dev``             The slow block device to be cached.

``cache_mode``              Optional, Only ``writeback`` is accepted at the
                            moment.

``data_crc``                Optional, default to ``false``

                            * ``true``  – store CRC32 for every cached entry
			      and verify on reads
                            * ``false`` – skip CRC (faster)
=========================  ====================================================

Example
-------

.. code-block:: shell

   dmsetup create pcache_sdb --table \
     "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"

The first time a pmem device is used, dm-pcache formats it automatically
(super-block, cache_info, etc.).


Status line
===========

``dmsetup status <device>`` (``STATUSTYPE_INFO``) prints:

::

   <sb_flags> <seg_total> <cache_segs> <segs_used> \
   <gc_percent> <cache_flags> \
   <key_head_seg>:<key_head_off> \
   <dirty_tail_seg>:<dirty_tail_off> \
   <key_tail_seg>:<key_tail_off>

Field meanings
--------------

===============================  =============================================
``sb_flags``                     Super-block flags (e.g. endian marker).

``seg_total``                    Number of physical *pmem* segments.

``cache_segs``                   Number of segments used for cache.

``segs_used``                    Segments currently allocated (bitmap weight).

``gc_percent``                   Current GC high-water mark (0-90).

``cache_flags``                  Bit 0 – DATA_CRC enabled
                                 Bit 1 – INIT_DONE (cache initialised)
                                 Bits 2-5 – cache mode (0 == WB).

``key_head``                     Where new key-sets are being written.

``dirty_tail``                   First dirty key-set that still needs
                                 write-back to the backing device.

``key_tail``                     First key-set that may be reclaimed by GC.
===============================  =============================================


Messages
========

*Change GC trigger*

::

   dmsetup message <dev> 0 gc_percent <0-90>


Theory of operation
===================

Sub-devices
-----------

====================  =========================================================
backing_dev             Any block device (SSD/HDD/loop/LVM, etc.).
cache_dev               DAX device; must expose direct-access memory.
====================  =========================================================

Segments and key-sets
---------------------

* The pmem space is divided into *16 MiB segments*.
* Each write allocates space from a per-CPU *data_head* inside a segment.
* A *cache-key* records a logical range on the origin and where it lives
  inside pmem (segment + offset + generation).
* 128 keys form a *key-set* (kset); ksets are written sequentially in pmem
  and are themselves crash-safe (CRC).
* The pair *(key_tail, dirty_tail)* delimit clean/dirty and live/dead ksets.

Write-back
----------

Dirty keys are queued into a tree; a background worker copies data
back to the backing_dev and advances *dirty_tail*.  A FLUSH/FUA bio from the
upper layers forces an immediate metadata commit.

Garbage collection
------------------

GC starts when ``segs_used >= seg_total * gc_percent / 100``.  It walks
from *key_tail*, frees segments whose every key has been invalidated, and
advances *key_tail*.

CRC verification
----------------

If ``data_crc is enabled`` dm-pcache computes a CRC32 over every cached data
range when it is inserted and stores it in the on-media key.  Reads
validate the CRC before copying to the caller.


Failure handling
================

* *pmem media errors* – all metadata copies are read with
  ``copy_mc_to_kernel``; an uncorrectable error logs and aborts initialisation.
* *Cache full* – if no free segment can be found, writes return ``-EBUSY``;
  dm-pcache retries internally (request deferral).
* *System crash* – on attach, the driver replays ksets from *key_tail* to
  rebuild the in-core trees; every segment’s generation guards against
  use-after-free keys.


Limitations & TODO
==================

* Only *write-back* mode; other modes planned.
* Only FIFO cache invalidate; other (LRU, ARC...) planned.
* Table reload is not supported currently.
* Discard planned.


Example workflow
================

.. code-block:: shell

   # 1.  Create devices
   dmsetup create pcache_sdb --table \
     "0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"

   # 2.  Put a filesystem on top
   mkfs.ext4 /dev/mapper/pcache_sdb
   mount /dev/mapper/pcache_sdb /mnt

   # 3.  Tune GC threshold to 80 %
   dmsetup message pcache_sdb 0 gc_percent 80

   # 4.  Observe status
   watch -n1 'dmsetup status pcache_sdb'

   # 5.  Shutdown
   umount /mnt
   dmsetup remove pcache_sdb


``dm-pcache`` is under active development; feedback, bug reports and patches
are very welcome!