# SPDX-License-Identifier: Apache-2.0
Device-DAX (/dev/dax)#
Overview#
The DAX storage plugin maps a /dev/dax device using mmap(MAP_SHARED)
and uses the mapped region as a fixed-size arena for KV cache chunks.
Typical /dev/dax devices include persistent memory,
CXL-attached memory, and other byte-addressable memory devices.
Data stored on the DAX device may survive process restarts, but is not guaranteed to be durable.
KV cache data is stored in the DAX region as part of the backend’s storage flow. Reads copy data back into CPU-backed memory objects.
Configuration#
local_cpu: true
max_local_cpu_size: 80
storage_plugins: ["dax"]
extra_config:
storage_plugin.dax.module_path: lmcache.v1.storage_backend.plugins.dax_backend
storage_plugin.dax.class_name: DaxBackend
dax.device_path: "/dev/dax1.0"
dax.max_dax_size: 100
dax.restore_workers: 8
dax.restore_max_regions: 8
dax.retrieve_staging_slab_bytes: 268435456
Multiprocess Mode#
In LMCache multiprocess mode, Device-DAX is configured as a built-in L2
adapter named dax. The MP adapter uses the normal L2 adapter
submit -> event fd -> query contract; no vLLM connector protocol changes
are required.
lmcache server \
--l1-size-gb 80 \
--eviction-policy LRU \
--l2-adapter '{
"type": "dax",
"device_path": "/dev/dax1.0",
"max_dax_size_gb": 100,
"slot_bytes": 268435456,
"num_store_workers": 1,
"num_lookup_workers": 1,
"num_load_workers": 4
}'
The legacy single-device --l2-adapter JSON accepts these fields:
device_path: required path to a readable and writable DAX device.max_dax_size_gb: required mapped size in GiB. The value must fit within the device capacity when capacity can be determined withfstat.slot_bytes: required fixed slot size in bytes. It must be large enough for one full LMCache chunk.num_store_workers: optional store worker count, default1.num_lookup_workers: optional lookup worker count, default1.num_load_workers: optional load worker count, defaultmin(4, os.cpu_count()).persist_enabled: accepted by common MP L2 parsing but ignored bydaxin this release.
Runtime hotplug uses the multi-device form. The devices list may also be
empty when hotplug_enabled is true.
lmcache server \
--l1-size-gb 80 \
--eviction-policy LRU \
--l2-adapter '{
"type": "dax",
"devices": [
{"device_path": "/dev/daxX.X", "max_dax_size_gb": 100},
{"device_path": "/dev/daxY.Y", "max_dax_size_gb": 100}
],
"slot_bytes": 268435456,
"hotplug_enabled": true,
"num_store_workers": 1,
"num_lookup_workers": 1,
"num_load_workers": 4
}'
MP DAX stores opaque ObjectKey values in memory and is volatile-only in
this release. Closing and reopening the server on the same DAX path starts
with an empty index, so previously written bytes are not discoverable after
restart.
MP DAX uses one stable adapter facade per LMCache server. The facade owns
stable event fds and worker pools, and runtime add/remove/resize only changes
the mapped DAX cores behind that facade. It does not add kernel-level CXL or
DAX reconfiguration, per-TP DAX partitions, on-device metadata, or restart
recovery. Capacity accounting and eviction are slot-based: a stored object
occupies one slot even if its payload is smaller than slot_bytes.
Runtime Hotplug API#
Runtime hotplug is disabled unless hotplug_enabled is true. The API
changes only LMCache runtime mappings and metadata; the /dev/dax* device
must already exist and be readable and writable by the LMCache server process.
The runtime endpoints are implemented through StorageManager’s generic L2
adapter reconfiguration interface, which routes backend, operation name, and
adapter-specific payload to the selected adapter. DAX owns the path, mode,
migration, and resize semantics; the generic interface is reusable by other
adapters such as P2P.
Use JSON bodies because DAX paths contain slashes:
curl http://127.0.0.1:9000/reconfigure/dax/status
curl -X POST http://127.0.0.1:9000/reconfigure/dax/add \
-H 'Content-Type: application/json' \
-d '{"device_path": "/dev/daxX.X", "size": "100GiB"}'
curl -X POST http://127.0.0.1:9000/reconfigure/dax/remove \
-H 'Content-Type: application/json' \
-d '{"device_path": "/dev/daxX.X", "mode": "migrate"}'
curl -X POST http://127.0.0.1:9000/reconfigure/dax/resize \
-H 'Content-Type: application/json' \
-d '{"device_path": "/dev/daxX.X", "size": "200GiB"}'
size is required for add and resize. Use an integer byte count or a string
such as "100GiB". remove supports these modes:
migrate: move DAX-resident KV to other active DAX devices before closing the source device.evict: delete DAX-resident KV on the source device. This is destructive for the DAX tier.drain: stop new writes to the source device and leave existing KV readable until it is evicted or the server closes.
resize supports migrate and evict modes. It does not support
drain because resize completes synchronously.
Hotplug operations are lock-safe by default. A remove or shrink that would
delete externally locked or borrowed slots returns 409 Conflict unless
force is set. A migration that has no active destination capacity returns
507 Insufficient Storage. Resize grow preserves the in-memory key index and
does not move KV payloads. Resize shrink never silently drops keys; entries
outside the new slot range must migrate first, or the request fails.
Hardware Validation Flow#
Use the same Qwen 8B or 14B long-context workload before and after a runtime
capacity change. Without hotplug support, /reconfigure/dax/status and
/reconfigure/dax/add are not available; changing the DAX device set
requires restarting LMCache with a new --l2-adapter value, which drops the
volatile DAX key index.
export MODEL=Qwen/Qwen3-8B # or a local Qwen 8B/14B checkpoint
curl http://127.0.0.1:9000/reconfigure/dax/status
python benchmarks/long_doc_qa/long_doc_qa.py \
--model "$MODEL" --num-documents 1 --document-length 1024 \
--output-len 16 --repeat-count 2 --repeat-mode tile \
--completions --host 127.0.0.1 --port 8000 --json-output
curl -X POST http://127.0.0.1:9000/reconfigure/dax/add \
-H 'Content-Type: application/json' \
-d '{"device_path": "/dev/daxX.X", "size": "100GiB"}'
curl http://127.0.0.1:9000/reconfigure/dax/status
Record these fields for the comparison:
total_capacity_bytesbefore and after/reconfigure/dax/add.total_used_byteswhile the Qwen workload is running.Whether an LMCache restart was required.
Whether the same cached prompt remains retrievable after the capacity change.
Using The Batched Restore Path#
The current DAX optimization is a staged batched restore path for retrieval. It is enabled automatically whenever the DAX backend is configured. No extra feature flag is required.
The retrieve flow is:
Reserve a batched set of readable DAX chunks.
Allocate CPU restore buffers from
LocalCPUBackend.Copy DAX data into a backend-owned pinned staging slab in coalesced regions.
Copy from the staging slab into the final CPU
MemoryObjoutputs.Upload those CPU outputs through the normal GPU connector path.
The store flow is unchanged: KV data is still staged through CPU memory before being written into the DAX arena.
The new DAX tuning knobs control the batched restore path:
dax.restore_workers: number of persistent worker threads used to execute restore regions in parallel.dax.restore_max_regions: maximum number of restore regions in one wave. Larger values increase parallelism but also increase slab space requirements.dax.retrieve_staging_slab_bytes: total size in bytes of the reusable pinned retrieve slab. This must be large enough to hold one full chunk per configured restore region.
For a first pass, start with:
dax.restore_workersequal to the number of CPU workers you want devoted to DAX restoresdax.restore_max_regionsequal todax.restore_workersdax.retrieve_staging_slab_bytesat leastdax.restore_max_regions * full_chunk_size, then scale upward if larger batched restores are common
If retrieve throughput is low, increase the slab size first, then increase
worker and region counts together. If CPU pressure is high, reduce
dax.restore_workers and dax.restore_max_regions.
Runtime Requirements#
extra_config['dax.device_path']is required and must point to a readable and writable DAX device.The process must have read-write access to the DAX device (e.g., via appropriate permissions or group membership).
LocalCPUBackendmust be enabled because DAX reads return CPU-backed memory objects.
Validation and Current Limits#
Tensor parallelism is currently limited to TP=1 (
metadata.world_size == 1).Only single-tensor chunk layouts are supported. Multi-tensor put requests are rejected.
Batched restore uses a backend-owned retrieve staging slab and persistent restore executors. The slab and region count can be tuned with
dax.restore_workers,dax.restore_max_regions, anddax.retrieve_staging_slab_bytes.Blocking batched restore preserves positional output semantics, while asynchronous batched restore returns only the consecutive hit prefix.