快速入门#

本指南帮助您在几分钟内完成 LMCache 的端到端运行。使用下面的选项卡切换引擎。步骤是相同的；只有库和启动命令有所不同。

vLLM

安装 LMCache

uv venv --python 3.12
source .venv/bin/activate
uv pip install lmcache vllm

LMCache 支持与 vLLM 的两种部署模式：

MP 模式（多进程） -- 推荐。 LMCache 作为独立服务运行，vLLM 通过 LMCacheMPConnector 连接。扩展性更好，提供管理与可观察性端点，并支持在多个引擎实例之间共享同一缓存。
进程内模式 -- LMCache 通过 LMCacheConnectorV1 在 vLLM 进程内部运行。单个命令，方便进行快速的单节点实验。

MP 模式（推荐）

启动 LMCache 服务器。 --host / --port 设置 vLLM 连接的 ZMQ 地址；它们在这里列出以便两个命令对齐（这些也是默认值）：

# chunk-size 16 is an illustrative demo value so a short
# prompt produces visible cache traffic; use the default
# (256) in production.
lmcache server \
    --host localhost --port 5555 \
    --l1-size-gb 20 --eviction-policy LRU --chunk-size 16

ZMQ 端口 (--port, 默认 5555) 接受来自 vLLM 的连接；HTTP 前端 (默认 8080) 提供管理和指标端点。有关 lmcache server 和连接器选项的完整列表，请参见配置参考。

在单独的终端中启动 vLLM 和 MP 连接器。通过 kv_connector_extra_config 中的 lmcache.mp.host / lmcache.mp.port 将连接器指向上述服务器——主机必须带有 ZMQ 传输前缀，例如 tcp://：

vllm serve Qwen/Qwen3-8B \
    --port 8000 --kv-transfer-config \
    '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both", "kv_connector_extra_config": {"lmcache.mp.host": "tcp://localhost", "lmcache.mp.port": 5555}}'

备注

``LMCacheMPConnector`` 解析到哪里？ 这取决于您的 vLLM 版本：

vLLM < 0.20.0 -- "kv_connector":"LMCacheMPConnector" 始终解析为 vLLM 内置的 vllm.distributed.kv_transfer.kv_connector.v1.LMCacheMPConnector；无法将其重定向到 LMCache 提供的实现。
vLLM >= 0.20.0 -- "kv_connector":"LMCacheMPConnector" 仍然默认为 vLLM 的内置连接器，但您可以通过添加 kv_connector_module_path 选择使用 LMCache 提供的实现 (lmcache.integration.vllm.lmcache_mp_connector):
```
vllm serve Qwen/Qwen3-8B \
    --port 8000 --kv-transfer-config \
    '{"kv_connector":"LMCacheMPConnector", "kv_connector_module_path":"lmcache.integration.vllm.lmcache_mp_connector", "kv_role":"kv_both"}'
```
LMCache 提供的连接器跟踪最新的 LMCache 服务器协议，并在集成到 vLLM 的版本之前发布修复和功能，因此在使用 vLLM 0.20.0 或更高版本时，建议优先使用它。

测试 -- 打开一个新终端并发送两个共享前缀的请求：

第一次请求

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "prompt": "Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts",
    "max_tokens": 100,
    "temperature": 0.7
  }'

第二个请求

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "prompt": "Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models",
    "max_tokens": 100,
    "temperature": 0.7
  }'

您应该看到类似于这样的 LMCache 日志 -- 在 MP 模式下，存储/检索日志来自独立的 lmcache server 进程，每个条目对应一个块。

第一次请求 -- 缓存为空，因此每个对齐的块都被卸载：

[2026-04-22 19:49:56,316] LMCache INFO: Stored 16 tokens in 0.023 seconds (server.py:390:lmcache.v1.multiprocess.server)
[2026-04-22 19:49:56,555] LMCache INFO: Stored 16 tokens in 0.005 seconds (server.py:390:lmcache.v1.multiprocess.server)
[2026-04-22 19:49:56,691] LMCache INFO: Stored 16 tokens in 0.005 seconds (server.py:390:lmcache.v1.multiprocess.server)
...

第二个请求 -- 共享前缀从 CPU 内存中检索；只有新的尾部被存储：

[2026-04-22 19:50:04,686] LMCache INFO: Retrieved 16 tokens in 0.003 seconds (server.py:573:lmcache.v1.multiprocess.server)
[2026-04-22 19:50:04,832] LMCache INFO: Stored 16 tokens in 0.005 seconds (server.py:390:lmcache.v1.multiprocess.server)
[2026-04-22 19:50:04,968] LMCache INFO: Stored 16 tokens in 0.005 seconds (server.py:390:lmcache.v1.multiprocess.server)
...

有关请求级统计信息（命中率、传输字节数），请参见可观察性。

进程内模式

在引擎进程中启动嵌入了 LMCache 的 vLLM：

# The chunk size here is only for illustration purpose, use default one (256) later
LMCACHE_CHUNK_SIZE=8 \
vllm serve Qwen/Qwen3-8B \
    --port 8000 --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'

备注

要进一步自定义，请创建一个配置文件。有关所有选项，请参见配置 LMCache。

替代的简单命令：

vllm serve <MODEL NAME> \
    --kv-offloading-backend lmcache \
    --kv-offloading-size <SIZE IN GB> \
    --disable-hybrid-kv-cache-manager

--disable-hybrid-kv-cache-manager 标志是必需的。来自配置 LMCache 页面的所有配置选项仍然适用。

测试 -- 打开一个新终端并发送两个共享前缀的请求：

第一次请求

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "prompt": "Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts",
    "max_tokens": 100,
    "temperature": 0.7
  }'

第二个请求

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "prompt": "Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models",
    "max_tokens": 100,
    "temperature": 0.7
  }'

您应该看到类似这样的 LMCache 日志 -- 进程内模式会将日志与 vLLM 引擎核心内联输出。

第一次请求 -- 提示被卸载到 LMCache：

(EngineCore_DP0 pid=458469) [2025-09-30 00:08:43,982] LMCache INFO: Stored 31 out of total 31 tokens. size: 0.0040 gb, cost 1.95 ms, throughput: 1.98 GB/s; offload_time: 1.88 ms, put_time: 0.07 ms

第二个请求 -- 命中缓存并存储新的尾部：

Reqid: cmpl-6709d8795d3c4464b01999c9f3fffede-0, Total tokens 32, LMCache hit tokens: 24, need to load: 8
(EngineCore_DP0 pid=494270) [2025-09-30 01:12:36,502] LMCache INFO: Retrieved 8 out of 24 required tokens (from 32 total tokens). size: 0.0011 gb, cost 0.55 ms, throughput: 1.98 GB/s;
(EngineCore_DP0 pid=494270) [2025-09-30 01:12:36,509] LMCache INFO: Storing KV cache for 8 out of 32 tokens (skip_leading_tokens=24)
(EngineCore_DP0 pid=494270) [2025-09-30 01:12:36,510] LMCache INFO: Stored 8 out of total 8 tokens. size: 0.0011 gb, cost 0.43 ms, throughput: 2.57 GB/s; offload_time: 0.40 ms, put_time: 0.03 ms

总令牌 32：新提示在分词后有 32 个令牌。
LMCache 命中令牌: 24: 从第一个请求中找到 24 个令牌（完整的 8 令牌块），该请求存储了 31 个令牌。
需要加载: 8: vLLM 自动前缀缓存使用块大小 16；16 个令牌已经存储在显存中，因此 LMCache 只加载 24-16=8。
为什么是 24 个命中令牌而不是 31 个？ LMCache 每 8 个令牌（8、16、24、31）进行哈希。它匹配页面对齐的块，因此使用 24 令牌哈希。
存储另外 8 个令牌：这 8 个新令牌形成一个完整的块，并被存储以供将来重用。

SGLang

备注

SGLang 集成现在默认使用 MP（多进程）模式。请参考 examples/sgl_integration/README.md 以获取当前的设置说明。

安装 SGLang

uv venv --python 3.12
source .venv/bin/activate
uv pip install --prerelease=allow lmcache "sglang"

使用 LMCache 启动 SGLang

cat > lmc_config.yaml <<'EOF'
chunk_size: 8  # demo only; use 256 for production
local_cpu: true
use_layerwise: true
max_local_cpu_size: 10  # GB
EOF

export LMCACHE_CONFIG_FILE=$PWD/lmc_config.yaml

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-8B \
  --host 0.0.0.0 \
  --port 30000 \
  --enable-lmcache

备注

通过配置文件配置 LMCache。有关完整列表，请参见配置 LMCache。

测试 -- 打开一个新终端并发送两个共享前缀的请求：

第一次请求

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [{"role": "user", "content": "Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts"}],
    "max_tokens": 100,
    "temperature": 0.7
  }'

第二个请求

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [{"role": "user", "content": "Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models"}],
    "max_tokens": 100,
    "temperature": 0.7
  }'

您应该看到类似于以下的 LMCache 日志：

第一次请求 -- 提示词和生成的令牌被存储：

Prefill batch, #new-seq: 1, #new-token: 35, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
Decode batch, #running-req: 1, #token: 74, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1.63, #queue-req: 0,
Decode batch, #running-req: 1, #token: 114, token usage: 0.00, cuda graph: True, gen throughput (token/s): 87.95, #queue-req: 0,
LMCache INFO: Stored 128 out of total 135 tokens. size: 0.0195 GB, cost 12.8890 ms, throughput: 1.5153 GB/s (cache_engine.py:623:lmcache.v1.cache_engine)

第二个请求 -- Radix Cache 和 LMCache 共享前缀；仅存储新部分：

Prefill batch, #new-seq: 1, #new-token: 10, #cached-token: 30, token usage: 0.00, #running-req: 0, #queue-req: 0,
Decode batch, #running-req: 1, #token: 64, token usage: 0.00, cuda graph: True, gen throughput (token/s): 8.29, #queue-req: 0,
Decode batch, #running-req: 1, #token: 104, token usage: 0.00, cuda graph: True, gen throughput (token/s): 87.95, #queue-req: 0,
Decode batch, #running-req: 1, #token: 144, token usage: 0.00, cuda graph: True, gen throughput (token/s): 87.89, #queue-req: 0,
LMCache INFO: Stored 112 out of total 140 tokens. size: 0.0171 GB, cost 11.1986 ms, throughput: 1.5261 GB/s (cache_engine.py:623:lmcache.v1.cache_engine)

总令牌 140：SGLang 将 KV Cache 存储用于 Prefill 和解码令牌，因此总计 = 40 提示 + 100 生成 = 140 令牌。
缓存的令牌: 30: SGLang 的 Radix Attention Cache 重用了来自第一次请求的 30 个令牌。
LMCache 命中令牌: 24: LMCache 检测到从第一次请求中存储的 24 个令牌（3 个完整的 8 令牌块）。由于 Radix Cache 已经在显存中提供了 30 个令牌，因此这 24 个令牌不需要从 LMCache 加载或再次存储。
新令牌：10：只有 10 个提示令牌需要进行 Prefill 计算（40 个提示 - 30 个缓存 = 10）。
140 个中存储了 112 个：24 个令牌（3 个完整块）已在 LMCache 中，予以跳过。在剩余的 116 个令牌中，112 个（14 个完整的 8 令牌块）被存储。

TensorRT-LLM

备注

此集成依赖于 NVIDIA/TensorRT-LLM PR #12626 中的连接器预设注册表和匹配的 LMCache 适配器，这两者尚未在稳定版本中发布。在它们发布之前，请从源代码安装这两者：

uv venv --python 3.12
source .venv/bin/activate

# LMCache from source (dev branch)
uv pip install git+https://github.com/LMCache/LMCache.git@dev

# TensorRT-LLM from source — see NVIDIA's build guide:
# https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html

一旦两者在稳定版本中发布，安装命令将是：

uv pip install lmcache "tensorrt_llm>=<version>" \
    --extra-index-url https://pypi.nvidia.com

LMCache 通过 TRT-LLM 的 KV Cache Connector API 与 TensorRT-LLM 集成，并支持两种部署模式：

进程内模式 (connector: lmcache) -- LMCache 作为单例在 TRT-LLM 进程内运行。最简单的设置；无需管理额外的服务。
MP 模式 (connector: lmcache-mp) -- LMCache 作为独立服务器运行。多个 TRT-LLM 工作进程可以共享缓存，并且缓存在 TRT-LLM 崩溃后仍然存在。

进程内模式

通过环境变量配置 LMCache：

export PYTHONHASHSEED=0  # required — chunk hashing depends on stable hash()
export LMCACHE_CHUNK_SIZE=256
export LMCACHE_LOCAL_CPU=True
export LMCACHE_MAX_LOCAL_CPU_SIZE=2.0  # GiB

使用 connector: lmcache 构建 TRT-LLM LLM：

from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.llmapi.llm_args import (
    KvCacheConfig, KvCacheConnectorConfig,
)

llm = LLM(
    model="Qwen/Qwen2-1.5B-Instruct",
    backend="pytorch",
    kv_cache_config=KvCacheConfig(enable_block_reuse=True),
    kv_connector_config=KvCacheConnectorConfig(connector="lmcache"),
)

out = llm.generate(["Your prompt here"], SamplingParams(max_tokens=64))
print(out[0].outputs[0].text)

MP 模式

PYTHONHASHSEED=0 必须在两个终端中设置 -- 块哈希依赖于稳定的 hash()，服务器端与客户端必须使用相同的种子值。

启动 LMCache 服务器：

export PYTHONHASHSEED=0
lmcache server \
    --l1-size-gb 10 --eviction-policy LRU --chunk-size 256

在另一个终端中，通过 server_url 将 TRT-LLM 指向服务器：

export PYTHONHASHSEED=0
python run_trtllm.py

其中 run_trtllm.py 包含：

from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.llmapi.llm_args import (
    KvCacheConfig, KvCacheConnectorConfig,
)

llm = LLM(
    model="Qwen/Qwen2-1.5B-Instruct",
    backend="pytorch",
    kv_cache_config=KvCacheConfig(enable_block_reuse=True),
    kv_connector_config=KvCacheConnectorConfig(
        connector="lmcache-mp",
        server_url="tcp://localhost:5555",
    ),
)

out = llm.generate(["Your prompt here"], SamplingParams(max_tokens=64))
print(out[0].outputs[0].text)

备注

TRT-LLM 适配器以与 vLLM 适配器相同的方式读取 LMCacheEngineConfig：对于 YAML 文件使用 LMCACHE_CONFIG_FILE，否则使用单独的 LMCACHE_* 环境变量。有关所有选项，请参见配置 LMCache。

🎉 您现在可以在所有三个引擎中使用 LMCache 缓存和重用 KV 缓存。

下一步#

性能测试：尝试基准测试部分，以体验 LMCache 的性能优势和更全面的示例。
生产部署：使用 Docker 或 Kubernetes 部署 LMCache，并启用可观察性与调优 -- 请参阅部署指南。

快速入门#

更多 MP 服务器选项#

下一步#