示例：分离式 Prefill#

警告

本页面记录了 LMCache 的进程内模式（已弃用）的行为。请考虑使用 LMCache MP 模式以获得更好的功能支持和性能。有关此页面的 MP 模式等效内容，请参见分离式预填充。

使用 LMCache 作为 KV Cache 传输库，我们可以使用 vLLM 运行分离式 Prefill。目前，LMCache 使用 NIXL 作为传输层，通过 NVLink、RDMA 或 TCP 实现快速的 KV Cache 传输。

本指南演示如何在单台机器上使用单个预填充器和解码器设置（1P1D）运行带有分离式 Prefill 的 LMCache。该架构将 LLM 推理分为两个阶段：预填充和解码，分别在不同的 GPU 上运行，以更好地利用资源。

先决条件#

在开始之前，请确保您具备：

至少 2 个 GPU
已安装的 Python 包：
- lmcache (0.2.1 或更高版本)
- nixl (安装说明这里)
- vllm (最新主分支)
- httpx, fastapi, 和 uvicorn
一个有效的 Hugging Face 令牌 (HF_TOKEN)，具有访问 Llama 3.1 8B 模型的权限
（推荐）一台启用 NVLink 或 RDMA 的 GPU 机器

备注

您可以使用 ucx_perftest 来检查 GPU-GPU 内存传输并验证 NVLink 或 RDMA 连接。请参阅此链接： UCX Performance Test。

架构概述#

分离式 Prefill 设置由三个主要组件组成：

Prefill 服务器 (端口 8100): 处理 LLM 推理的 Prefill 阶段
解码器服务器 (端口 8200): 管理解码/生成阶段
代理服务器 (端口 9000): 协调预填充器和解码器之间的关系

配置#

Prefiller Server Configuration (lmcache-prefiller-config.yaml):

local_cpu: False

# PD-related configurations
enable_pd: True
transfer_channel: "nixl"  # Using NIXL for transfer
pd_role: "sender"          # Prefiller acts as KV cache sender
pd_proxy_host: "localhost" # Host where proxy server is running
pd_proxy_port: 7500        # Port where proxy server is listening
pd_buffer_size: 1073741824  # 1GB buffer for KV cache transfer
pd_buffer_device: "cuda"   # Use GPU memory for buffer

解码器服务器配置 (lmcache-decoder-config.yaml):

local_cpu: False

# PD-related configurations
enable_pd: True
transfer_channel: "nixl" # Using NIXL for transfer
pd_role: "receiver"        # Decoder acts as KV cache receiver
pd_peer_host: "localhost"  # Host where decoder is listening
pd_peer_init_port: 7300    # Port where initialization happens
pd_peer_alloc_port: 7400   # Port for memory allocation
pd_buffer_size: 1073741824  # 1GB buffer for KV cache transfer
pd_buffer_device: "cuda"   # Use GPU memory for buffer

逐步设置#

环境设置

在运行 vLLM 服务器之前，请设置您的 Hugging Face 令牌。
```
export HF_TOKEN=your_hugging_face_token
```

启动 vLLM + LMCache 推理服务器

您可以单独启动各个组件：

启动解码器（在 GPU 1 上）：

UCX_TLS=cuda_ipc,cuda_copy,tcp \
    LMCACHE_CONFIG_FILE=lmcache-decoder-config.yaml \
    CUDA_VISIBLE_DEVICES=1 \
    vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --port 7200 \
    --disable-log-requests \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "consumer1"}}'

启动 Prefill（在 GPU 0 上）：

UCX_TLS=cuda_ipc,cuda_copy,tcp \
    LMCACHE_CONFIG_FILE=lmcache-prefiller-config.yaml \
    CUDA_VISIBLE_DEVICES=0 \
    vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --port 7100 \
    --disable-log-requests \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_producer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "producer1"}}'

启动一个代理服务器，以协调 Prefiller 和解码器：

The code for the proxy server is available in vLLM repo.

python3 ../disagg_proxy_server.py \
  --host localhost \
  --port 9100 \
  --prefiller-host localhost \
  --prefiller-port 7100 \
  --num-prefillers 1 \
  --decoder-host localhost \
  --decoder-port 7200  \
  --decoder-init-port 7300 \
  --decoder-alloc-port 7400 \
  --proxy-host localhost \
  --proxy-port 7500 \
  --num-decoders 1

备注

UCX_TLS 环境变量用于指定 UCX 的传输层（示例使用 NVLink）。CUDA_VISIBLE_DEVICES 环境变量用于指定服务器使用的 GPU。

验证设置

当您可以访问以下内容时，服务器已准备就绪：
- Prefill: http://localhost:7100/v1/completions
- 解码器: http://localhost:7200/v1/completions
- 代理: http://localhost:9100/v1/completions

使用方法#

通过 completions 或 chat completions 端点向代理服务器（端口 9000）发送请求：

curl http://localhost:9000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "prompt": "Tell me a story",
        "max_tokens": 100
    }'

您还可以使用以下命令测试设置，该命令运行 vLLM 的服务基准测试：

git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks
vllm bench serve --port 9000 --seed $(date +%s) \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name random --random-input-len 5000 --random-output-len 200 \
    --num-prompts 50 --burstiness 100 --request-rate 1

监控#

预填充实例将记录 KV Cache 传输的吞吐量：

LMCache 信息：存储 5271 个 token 耗时：6.5000 毫秒，吞吐量：98.9889 GB/s；卸载时间：2.6594 毫秒，放置时间：3.4539 毫秒 (cache_engine.py:190:lmcache.v1.cache_engine)

解码器实例将记录从 LMCache 中获取了多少个 token：

LMCache 信息：请求 ID：cmpl-b8bf01cbe47e4d108732ceeb4158d310-0，总令牌数 5170，LMCache 命中令牌：5169，需要加载：5169 (vllm_v1_adapter.py:543:lmcache.integration.vllm.vllm_v1_adapter)

代理服务器将记录预填充节点的 TTFT：

===============================
Num requests: 49
Prefill node TTFT stats:
- Average (ms): 0.1530598815606565
- Median (ms): 0.15739011764526367
- 99th Percentile (ms): 0.1643616008758545
===============================

故障排除#

常见问题及解决方案：

GPU 要求：确保您至少有 2 个可用的 GPU
端口冲突：检查上述使用的端口是否可用
HF Token: 验证您的令牌是否以 hf_ 开头，并具有必要的模型访问权限
CUDA 错误：确保每个服务器的 CUDA_VISIBLE_DEVICES 设置正确