Configuration Reference#
This page documents every CLI argument accepted by the LMCache multiprocess server. Arguments are grouped by the config module that defines them.
MP Server#
Source: lmcache/v1/multiprocess/config.py
Argument |
Default |
Description |
|---|---|---|
|
|
Host address to bind the ZMQ server. |
|
|
Port to bind the ZMQ server. |
|
|
Chunk size for KV cache operations (in tokens). |
|
|
Base number of worker threads. Sets the default for both the GPU
(affinity) pool and the CPU (normal) pool. Can be overridden
per-pool with |
|
(inherits |
Worker threads for the GPU affinity pool (STORE/RETRIEVE). Requests from the same vLLM instance are always dispatched to the same thread, eliminating GPU transfer lock contention. |
|
(inherits |
Worker threads for the normal CPU pool (LOOKUP, etc.). |
|
|
Hash algorithm for token-based operations.
Choices: |
|
|
Cache engine backend type.
|
HTTP Frontend#
Source: lmcache/v1/multiprocess/config.py
Only available when running http_server.py.
Argument |
Default |
Description |
|---|---|---|
|
|
Host to bind the HTTP (FastAPI/uvicorn) server. |
|
|
Port to bind the HTTP server. |
L1 Memory Manager#
Source: lmcache/v1/distributed/config.py
Argument |
Default |
Description |
|---|---|---|
|
required |
Size of L1 memory in GB. |
|
|
Enable or disable lazy allocation for L1 memory.
Pass |
|
|
Initial allocation size (GB) when using lazy allocation. |
|
|
Alignment size in bytes (default 4 KB). |
L1 Manager TTLs#
Source: lmcache/v1/distributed/config.py
Argument |
Default |
Description |
|---|---|---|
|
|
Time-to-live for each object’s write lock (seconds). |
|
|
Time-to-live for each object’s read lock (seconds). |
Eviction Policy#
Source: lmcache/v1/distributed/config.py
Argument |
Default |
Description |
|---|---|---|
|
required |
Eviction policy.
Choices: |
|
|
Memory usage ratio (0.0–1.0) that triggers eviction. |
|
|
Fraction of allocated memory to evict when triggered (0.0–1.0). |
L2 Policies#
Source: lmcache/v1/distributed/config.py
Argument |
Default |
Description |
|---|---|---|
|
|
L2 store policy. Determines which adapters receive each key
and whether keys are deleted from L1 after L2 store.
The |
|
|
L2 prefetch policy. Determines which adapter loads each key
when multiple adapters have it.
The |
L2 Adapters#
Source: lmcache/v1/distributed/l2_adapters/config.py
L2 adapters are configured via repeatable --l2-adapter <JSON> arguments.
Each JSON object must include a "type" field that selects the adapter type.
The order of --l2-adapter arguments determines the adapter order (cascade).
Registered adapter types: nixl_store, fs, mock.
nixl_store – NIXL-based persistent storage#
Fields:
backend(required): One ofPOSIX,GDS,GDS_MT,HF3FS,OBJ.backend_params(required for file-based backends): Dict of string key-value pairs. File-based backends (GDS,GDS_MT,POSIX,HF3FS) requirefile_pathanduse_direct_io.pool_size(required): Number of storage descriptors to pre-allocate (> 0).
Examples:
# POSIX backend (local file system)
--l2-adapter '{"type": "nixl_store", "backend": "POSIX", "backend_params": {"file_path": "/data/lmcache/l2", "use_direct_io": "false"}, "pool_size": 64}'
# GDS backend (GPU Direct Storage)
--l2-adapter '{"type": "nixl_store", "backend": "GDS", "backend_params": {"file_path": "/data/nvme/lmcache", "use_direct_io": "true"}, "pool_size": 128}'
# GDS_MT backend (multi-threaded GDS)
--l2-adapter '{"type": "nixl_store", "backend": "GDS_MT", "backend_params": {"file_path": "/data/nvme/lmcache", "use_direct_io": "true"}, "pool_size": 128}'
# HF3FS backend (shared file system)
--l2-adapter '{"type": "nixl_store", "backend": "HF3FS", "backend_params": {"file_path": "/mnt/hf3fs/lmcache", "use_direct_io": "false"}, "pool_size": 64}'
# OBJ backend (object store -- no file_path needed)
--l2-adapter '{"type": "nixl_store", "backend": "OBJ", "backend_params": {}, "pool_size": 32}'
fs – File-system backed storage#
A pure file-system L2 adapter using async I/O.
Fields:
base_path(required): Directory for storing KV cache files.relative_tmp_dir(optional): Relative sub-dir for temp files.read_ahead_size(optional): Trigger read-ahead by reading this many bytes first.use_odirect(optional): Bypass page cache viaO_DIRECT(defaultfalse).
Examples:
# Basic FS adapter
--l2-adapter '{"type": "fs", "base_path": "/data/lmcache/l2"}'
# With temp directory
--l2-adapter '{"type": "fs", "base_path": "/data/lmcache/l2", "relative_tmp_dir": ".tmp"}'
mock – Mock adapter for testing#
Fields:
max_size_gb(required): Maximum size of the adapter in GB (> 0).mock_bandwidth_gb(required): Simulated bandwidth in GB/sec (> 0).
Example:
--l2-adapter '{"type": "mock", "max_size_gb": 256, "mock_bandwidth_gb": 10}'
Multiple adapters (cascade)#
Pass --l2-adapter multiple times. Adapters are used in the order given:
--l2-adapter '{"type": "nixl_store", "backend": "POSIX", "backend_params": {"file_path": "/data/ssd/l2", "use_direct_io": "false"}, "pool_size": 64}' \
--l2-adapter '{"type": "nixl_store", "backend": "GDS", "backend_params": {"file_path": "/data/nvme/l2", "use_direct_io": "true"}, "pool_size": 128}'
Observability#
Source: lmcache/v1/mp_observability/config.py
See Observability for full details on the three modes (metrics, logging, tracing).
Argument |
Default |
Description |
|---|---|---|
|
off |
Master switch: disable the EventBus entirely. |
|
off |
Skip metrics subscribers (no Prometheus endpoint). |
|
off |
Skip logging subscribers. |
|
off |
Register tracing subscribers. Requires |
|
|
Max events in the EventBus queue before tail-drop. |
|
(none) |
OTLP gRPC endpoint for exporting metrics and traces. |
|
|
Port for the Prometheus |
vLLM Client Configuration#
On the vLLM side, specify the LMCache server host and port via the
kv_connector_extra_config parameter:
vllm serve Qwen/Qwen3-14B \
--kv-transfer-config \
'{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both", "kv_connector_extra_config": {"lmcache.mp.host": "127.0.0.1", "lmcache.mp.port": 6000}}'
Environment Variables#
Variable |
Description |
|---|---|
|
Log level for LMCache ( |
|
Set to a fixed value for reproducible hashing across processes
(relevant when using |
Full Example#
lmcache server \
--host 0.0.0.0 \
--port 6555 \
--chunk-size 512 \
--max-workers 4 \
--max-gpu-workers 2 \
--hash-algorithm blake3 \
--engine-type default \
--l1-size-gb 100 \
--l1-use-lazy \
--l1-init-size-gb 20 \
--l1-align-bytes 4096 \
--l1-write-ttl-seconds 600 \
--l1-read-ttl-seconds 300 \
--eviction-policy noop \
--l2-store-policy skip_l1 \
--eviction-trigger-watermark 0.9 \
--eviction-ratio 0.1 \
--l2-prefetch-policy default \
--l2-adapter '{"type": "nixl_store", "backend": "POSIX", "backend_params": {"file_path": "/data/lmcache/l2", "use_direct_io": "false"}, "pool_size": 64}' \
--prometheus-port 9090 \
--prometheus-log-interval 10 \
--enable-telemetry \
--telemetry-processor '{"type": "logging", "log_level": "DEBUG"}'