Configuration Reference#
This page documents every CLI argument accepted by the LMCache multiprocess server. Arguments are grouped by the config module that defines them.
MP Server#
Source: lmcache/v1/multiprocess/config.py
Argument |
Default |
Description |
|---|---|---|
|
(unset, default UUID v4) |
Stable identity of this MP server. Used as the coordinator
membership key and projected onto the OTel
|
|
|
Host address to bind the ZMQ server. |
|
|
Port to bind the ZMQ server. |
|
|
Chunk size for KV cache operations (in tokens). |
|
|
Base number of worker threads. Sets the default for both the GPU
(affinity) pool and the CPU (normal) pool. Can be overridden
per-pool with |
|
(inherits |
Worker threads for the GPU affinity pool (STORE/RETRIEVE). Requests from the same vLLM instance are always dispatched to the same thread, eliminating GPU transfer lock contention. |
|
(inherits |
Worker threads for the normal CPU pool (LOOKUP, etc.). |
|
|
Hash algorithm for token-based operations.
Choices: |
|
|
Cache engine backend type. |
|
|
Which worker → server transfer paths the server loads.
|
|
|
Zero or more paths to runtime plugin scripts or directories to
launch alongside the server. Plugins are spawned by
|
|
|
JSON string of extra key-value config forwarded to runtime
plugins via |
|
|
Space-separated list of Python module names that scripts posted
to the HTTP |
|
(not set) |
SHM segment name for non-GPU KV transfer (only used when the
non-GPU path is loaded, i.e. |
|
|
Silence budget (seconds) after which a worker that has sent at
least one heartbeat PING but then gone quiet has its KV cache
registration reaped, freeing the leaked GPU context and CUDA IPC
handles. |
|
|
Silence budget (seconds) for a worker that registered but has never
sent a PING (still warming up, or died before its first request).
Must be >= |
|
|
CacheBlend ( |
Lookup Hash Logging#
Source: lmcache/v1/mp_observability/subscribers/logging/lookup_hash.py
When enabled, the server publishes chunk hashes computed during lookup()
as MP_LOOKUP events on the EventBus. The
LookupHashLoggingSubscriber writes these to rotating JSONL files for
offline analysis. Disabled by default. These arguments are part of the
Observability group.
Argument |
Default |
Description |
|---|---|---|
|
|
Directory to write lookup hash JSONL files. An empty string disables logging. |
|
|
Time interval in seconds before rotating to a new log file. |
|
|
Max file size in bytes before rotating even if the time interval has not elapsed. |
|
|
Max number of log files to keep. Oldest files are deleted when this limit is exceeded. |
HTTP Frontend#
Source: lmcache/v1/multiprocess/config.py
The HTTP frontend is included when running lmcache server.
Argument |
Default |
Description |
|---|---|---|
|
|
Host to bind the HTTP (FastAPI/uvicorn) server. |
|
|
Port to bind the HTTP server. |
L1 Memory Manager#
Source: lmcache/v1/distributed/config.py
Argument |
Default |
Description |
|---|---|---|
|
required |
Size of the L1 tier in GB. Sizes the pinned-DRAM L1 by default, or the
GDS slab file when |
|
|
Enable or disable lazy allocation for L1 memory.
Pass |
|
|
Initial allocation size (GB) when using lazy allocation. |
|
|
Alignment size in bytes (default 4 KB). |
|
(not set) |
Optional |
GDS L1 Tier#
Source: lmcache/v1/distributed/config.py
Opt-in. Setting --gds-l1-path switches the L1 medium from pinned DRAM to
an NVMe slab file accessed via GPUDirect Storage (cuFile DMA). The CPU
pinned-DRAM tier is then disabled, and --l1-size-gb sizes the slab.
Disable byte-array L2 adapters when this is on (the GDS tier exposes no L1
memory buffer for them to register).
Argument |
Default |
Description |
|---|---|---|
|
Not set |
NVMe directory for the GDS L1 slab. Setting this enables the GDS L1
tier; one shared slab per process lives at
|
|
|
Open the slab with |
L1 Manager TTLs#
Source: lmcache/v1/distributed/config.py
Argument |
Default |
Description |
|---|---|---|
|
|
Time-to-live for each object’s write lock (seconds). |
|
|
Time-to-live for each object’s read lock (seconds). |
Eviction Policy#
Source: lmcache/v1/distributed/config.py
Argument |
Default |
Description |
|---|---|---|
|
required |
Eviction policy.
Choices: |
|
|
Memory usage ratio (0.0–1.0) that triggers eviction. |
|
|
Fraction of allocated memory to evict when triggered (0.0–1.0). |
L2 Policies#
Source: lmcache/v1/distributed/config.py
Argument |
Default |
Description |
|---|---|---|
|
|
L2 store policy. Determines which adapters receive each key
and whether keys are deleted from L1 after L2 store.
The |
|
|
L2 prefetch policy. Determines which adapter loads each key
when multiple adapters have it.
The |
|
|
Maximum number of concurrent prefetch (L2 load) requests. Limits how many in-flight loads the PrefetchController may issue at once, preventing excessive L1 memory pressure. |
|
|
Interval in milliseconds for the periodic event notifier heartbeat. A native C++ background thread writes to all registered file descriptors at this interval, waking controller poll loops for L2 adapters that lack native async completion callbacks. |
L2 Adapters#
Source: lmcache/v1/distributed/l2_adapters/config.py
L2 adapters are configured via repeatable --l2-adapter <JSON> arguments.
Each JSON object must include a "type" field that selects the adapter type.
The order of --l2-adapter arguments determines the adapter order (cascade).
Registered adapter types: nixl_store, nixl_store_dynamic, fs,
fs_native, mock, mooncake_store, aerospike, s3, resp,
plugin, native_plugin, raw_block, dax.
Each adapter type’s required and optional fields, plus per-backend examples, are
documented on its own page under Secondary KV Storage
– including the adapters not detailed inline here (fs_native,
raw_block, dax, mooncake_store, hfbucket, resp).
aerospike – Aerospike native connector#
Native C++ Aerospike L2 adapter (optional; build with BUILD_AEROSPIKE=1).
See Secondary KV Storage for build prerequisites and the full field list.
Fields:
hosts(required): Seed hostshost:port[,host:port...].namespace(optional, default ``”lmcache”``): Aerospike namespace.set_name/set(optional, default ``”kv_chunks”``): Aerospike set.num_workers(optional, default ``8``): C++ I/O worker threads.read_timeout_ms/write_timeout_ms(optional): Client timeouts.default_ttl_seconds(optional, default ``86400``): Record TTL (0= namespace default).target_segment_bytes/max_record_bytes(optional, default ``0``): Shard target and record-cap override (0= auto-discover).username/password(optional): Enterprise Edition auth.max_capacity_gb(optional, default ``0``): L2 capacity for eviction (0disables tracking).
Example:
--l2-adapter '{"type": "aerospike", "hosts": "127.0.0.1:3000", "namespace": "lmcache", "set_name": "kv_chunks", "num_workers": 8}'
Multiple adapters (cascade)#
Pass --l2-adapter multiple times. Adapters are used in the order given:
--l2-adapter '{"type": "nixl_store", "backend": "POSIX", "backend_params": {"file_path": "/data/ssd/l2", "use_direct_io": "false"}, "pool_size": 64}' \
--l2-adapter '{"type": "nixl_store", "backend": "GDS", "backend_params": {"file_path": "/data/nvme/l2", "use_direct_io": "true"}, "pool_size": 128}'
Observability#
Source: lmcache/v1/mp_observability/config.py
See Observability for full details on the three modes (metrics, logging, tracing).
Argument |
Default |
Description |
|---|---|---|
|
off |
Master switch: disable the EventBus entirely. |
|
off |
Skip metrics subscribers (no Prometheus endpoint). |
|
off |
Skip logging subscribers. |
|
off |
Register tracing subscribers. Requires |
|
|
Max events in the EventBus queue before tail-drop. |
|
(none) |
OTLP gRPC endpoint for exporting metrics and traces. |
|
|
Port for the Prometheus |
vLLM Client Configuration#
On the vLLM side, specify the LMCache server host and port via the
kv_connector_extra_config parameter:
vllm serve Qwen/Qwen3-14B \
--kv-transfer-config \
'{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both", "kv_connector_extra_config": {"lmcache.mp.host": "127.0.0.1", "lmcache.mp.port": 6000}}'
LMCacheMPConnector reads the following keys from
kv_connector_extra_config:
Connector extra_config Keys#
All connector-level options are passed through
kv_connector_extra_config and use the lmcache.mp. prefix.
Key |
Default |
Description |
|---|---|---|
|
|
Host (with ZMQ transport prefix) of the LMCache MP server. |
|
|
Port of the LMCache MP server. Must match the server’s |
|
|
Timeout (seconds) for blocking message-queue requests, including
the initial chunk-size query and KV cache
registration/unregistration. If the server does not respond within
this window, the connector raises |
|
|
Interval (seconds) between periodic heartbeat pings sent from the connector to the server. |
|
|
Routing mode for the worker -> server transfer context. One of
|
Environment Variables#
Variable |
Description |
|---|---|
|
Log level for LMCache ( |
|
Set to a fixed value for reproducible hashing across processes
(relevant when using |
Full Example#
lmcache server \
--host 0.0.0.0 \
--port 6555 \
--chunk-size 512 \
--max-workers 4 \
--max-gpu-workers 2 \
--hash-algorithm blake3 \
--engine-type default \
--lookup-hash-log-dir /data/lmcache/lookup_hashes \
--lookup-hash-log-rotation-interval 21600 \
--lookup-hash-log-rotation-max-size 104857600 \
--lookup-hash-log-max-files 100 \
--l1-size-gb 100 \
--l1-use-lazy \
--l1-init-size-gb 20 \
--l1-align-bytes 4096 \
--l1-write-ttl-seconds 600 \
--l1-read-ttl-seconds 300 \
--eviction-policy noop \
--l2-store-policy skip_l1 \
--eviction-trigger-watermark 0.9 \
--eviction-ratio 0.1 \
--l2-prefetch-policy default \
--l2-prefetch-max-in-flight 8 \
--periodic-notifier-interval-ms 5 \
--l2-adapter '{"type": "nixl_store", "backend": "POSIX", "backend_params": {"file_path": "/data/lmcache/l2", "use_direct_io": "false"}, "pool_size": 64}' \
--prometheus-port 9090 \
--metrics-sample-rate 0.01 \
--enable-tracing \
--otlp-endpoint http://localhost:4317