Quickstart#
This guide helps you get LMCache running end-to-end in a couple of minutes. Use the tabs below to switch the engine (similar to an environment toggle). Steps are the same; only the libraries and launch commands change.
(Terminal 1) Install LMCache
uv venv --python 3.12
source .venv/bin/activate
uv pip install lmcache vllm
Start vLLM with LMCache:
# The chunk size here is only for illustration purpose, use default one (256) later
LMCACHE_CHUNK_SIZE=8 \
vllm serve Qwen/Qwen3-8B-Instruct \
--port 8000 --kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
Note
To customize further, create a config file. See Configuring LMCache for all options.
Alternative simpler command:
vllm serve <MODEL NAME> \
--kv-offloading-backend lmcache \
--kv-offloading-size <SIZE IN GB> \
--disable-hybrid-kv-cache-manager
The --disable-hybrid-kv-cache-manager flag is mandatory. All configuration options from the Configuring LMCache page still apply.
(Terminal 1) Install SGLang
uv venv --python 3.12
source .venv/bin/activate
uv pip install --prerelease=allow lmcache "sglang"
Start SGLang with LMCache
cat > lmc_config.yaml <<'EOF'
chunk_size: 8 # demo only; use 256 for production
local_cpu: true
use_layerwise: true
max_local_cpu_size: 10 # GB
EOF
export LMCACHE_CONFIG_FILE=$PWD/lmc_config.yaml
python -m sglang.launch_server \
--model-path Qwen/Qwen3-14B \
--host 0.0.0.0 \
--port 30000 \
--enable-lmcache
Note
Configure LMCache via the config file. See Configuring LMCache for the full list.
(Terminal 2) Test LMCache in Action#
Open a new terminal. Pick your engine tab, send the first request, then an overlapping second request:
First request
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-8B-Instruct",
"prompt": "Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts",
"max_tokens": 100,
"temperature": 0.7
}'
Second request (overlap)
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-8B-Instruct",
"prompt": "Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models",
"max_tokens": 100,
"temperature": 0.7
}'
First request
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-14B",
"messages": [{"role": "user", "content": "Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts"}],
"max_tokens": 100,
"temperature": 0.7
}'
Second request (overlap)
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-14B",
"messages": [{"role": "user", "content": "Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models"}],
"max_tokens": 100,
"temperature": 0.7
}'
You should see LMCache logs like this (examples for each engine):
(EngineCore_DP0 pid=458469) [2025-09-30 00:08:43,982] LMCache INFO: Stored 31 out of total 31 tokens. size: 0.0040 gb, cost 1.95 ms, throughput: 1.98 GB/s; offload_time: 1.88 ms, put_time: 0.07 ms
Prefill batch, #new-seq: 1, #new-token: 35, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
Decode batch, #running-req: 1, #token: 74, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1.63, #queue-req: 0,
Decode batch, #running-req: 1, #token: 114, token usage: 0.00, cuda graph: True, gen throughput (token/s): 87.95, #queue-req: 0,
LMCache INFO: Stored 128 out of total 135 tokens. size: 0.0195 GB, cost 12.8890 ms, throughput: 1.5153 GB/s (cache_engine.py:623:lmcache.v1.cache_engine)
What this means: The first request caches the prompt. The second reuses the cached prefix and only loads the missing chunk. You should see logs like this:
Reqid: cmpl-6709d8795d3c4464b01999c9f3fffede-0, Total tokens 32, LMCache hit tokens: 24, need to load: 8
(EngineCore_DP0 pid=494270) [2025-09-30 01:12:36,502] LMCache INFO: Retrieved 8 out of 24 required tokens (from 32 total tokens). size: 0.0011 gb, cost 0.55 ms, throughput: 1.98 GB/s;
(EngineCore_DP0 pid=494270) [2025-09-30 01:12:36,509] LMCache INFO: Storing KV cache for 8 out of 32 tokens (skip_leading_tokens=24)
(EngineCore_DP0 pid=494270) [2025-09-30 01:12:36,510] LMCache INFO: Stored 8 out of total 8 tokens. size: 0.0011 gb, cost 0.43 ms, throughput: 2.57 GB/s; offload_time: 0.40 ms, put_time: 0.03 ms
Prefill batch, #new-seq: 1, #new-token: 10, #cached-token: 30, token usage: 0.00, #running-req: 0, #queue-req: 0,
Decode batch, #running-req: 1, #token: 64, token usage: 0.00, cuda graph: True, gen throughput (token/s): 8.29, #queue-req: 0,
Decode batch, #running-req: 1, #token: 104, token usage: 0.00, cuda graph: True, gen throughput (token/s): 87.95, #queue-req: 0,
Decode batch, #running-req: 1, #token: 144, token usage: 0.00, cuda graph: True, gen throughput (token/s): 87.89, #queue-req: 0,
LMCache INFO: Stored 112 out of total 140 tokens. size: 0.0171 GB, cost 11.1986 ms, throughput: 1.5261 GB/s (cache_engine.py:623:lmcache.v1.cache_engine)
What this means (per engine):
Total tokens 32: The new prompt has 32 tokens after tokenization.
LMCache hit tokens: 24: 24 tokens (full 8-token chunks) were found in the cache from the first request that stored 31 tokens.
Need to load: 8: vLLM auto prefix caching uses block size 16; 16 tokens already sit in GPU RAM, so LMCache only loads 24-16=8.
Why 24 hit tokens instead of 31? LMCache hashes every 8 tokens (8, 16, 24, 31). It matches page-aligned chunks, so it uses the 24-token hash.
Stored another 8 tokens: The new 8 tokens form a full chunk and are stored for future reuse.
Total tokens 140: SGLang stores KV cache for both prefill and decode tokens together, so total = 40 prompt + 100 generated = 140 tokens.
Cached tokens: 30: SGLang’s Radix Attention Cache reused 30 tokens from the first request.
LMCache hit tokens: 24: LMCache detected 24 tokens (3 full 8-token chunks) stored from the first request. Since Radix Cache already provides 30 tokens in GPU memory, these 24 tokens don’t need to be loaded from LMCache or stored again.
New tokens: 10: Only 10 prompt tokens need prefill computation (40 prompt - 30 cached = 10).
Stored 112 out of 140: 24 tokens (3 full chunks) are already in LMCache and skipped. Of the remaining 116 tokens, 112 (14 full 8-token chunks) are stored.
🎉 You now have LMCache caching and reusing KV caches for both engines.
Next Steps#
Performance Testing: Try the Benchmarking section to experience LMCache’s performance benefits with more comprehensive examples
More Examples: Explore the More Examples section for detailed examples including KV cache sharing across instances and disaggregated prefill