移动 KV Cache#

警告

本页面记录了 LMCache 的进程内模式（已弃用）的行为。请考虑使用 LMCache MP 模式以获得更好的功能支持和性能。

move 接口定义如下：

move(old_position: Tuple[str, str], new_position: Tuple[str, str],
     tokens: Optional[List[int]] = [], copy: Optional[bool] = False) -> event_id: str, num_tokens: int

该函数将由 tokens 标识的 KV Cache 块从 old_position 移动到 new_position。每个位置是一个 (instance_id, location) 的元组。将 copy 设置为 True 会复制 KV Cache，而不是移动它。

请注意，必须安装 NIXL 才能进行 P2P 传输。我们稍后将支持其他传输方式，例如 Python 套接字和 Mooncake。

示例用法：#

首先，准备两个 yaml 文件 instance1.yaml 和 instance2.yaml 来配置两个 lmcache 实例：

# instance1.yaml
chunk_size: 256
local_cpu: True
max_local_cpu_size: 5

# cache controller configurations
enable_controller: True
lmcache_instance_id: "lmcache_instance_1"
controller_pull_url: "localhost:8300"
controller_reply_url: "localhost:8400"
lmcache_worker_ports: 8500

# P2P configurations
enable_p2p: True
p2p_host: "localhost"
p2p_init_ports: 8200
p2p_lookup_ports: 8201
transfer_channel: "nixl"

# instance2.yaml
chunk_size: 256
local_cpu: True
max_local_cpu_size: 5

# cache controller configurations
enable_controller: True
lmcache_instance_id: "lmcache_instance_1"
controller_pull_url: "localhost:8300"
controller_reply_url: "localhost:8400"
lmcache_worker_ports: 8501

# P2P configurations
enable_p2p: True
p2p_host: "localhost"
p2p_init_ports: 8202
p2p_lookup_ports: 8203
transfer_channel: "nixl"

启动两个 vllm 引擎：

PYTHONHASHSEED=123 UCX_TLS=rc CUDA_VISIBLE_DEVICES=0 LMCACHE_CONFIG_FILE=instance1.yaml vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 4096 \
  --gpu-memory-utilization 0.8 --port 8000 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'

PYTHONHASHSEED=123 UCX_TLS=rc CUDA_VISIBLE_DEVICES=1 LMCACHE_CONFIG_FILE=instance2.yaml vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 4096 \
  --gpu-memory-utilization 0.8 --port 8001 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'

在9000端口启动lmcache控制器，在9001端口启动监视器：

PYTHONHASHSEED=123 lmcache_controller --host localhost --port 9000 --monitor-ports '{"pull": 8300, "reply": 8400}'

向 vllm 引擎 1 发送请求：

curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "prompt": "Explain the significance of KV cache in language models.",
        "max_tokens": 10
      }'

将提示进行分词以获取令牌 ID：

curl -X POST http://localhost:8000/tokenize \
  -H "Content-Type: application/json" \
  -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "prompt": "Explain the significance of KV cache in language models."
      }'

将 KV Cache 从引擎 1 的 CPU 移动到引擎 2 的 CPU，使用的 token ids 为：

curl -X POST http://localhost:9000/move \
  -H "Content-Type: application/json" \
  -d '{
        "old_position": ["lmcache_instance_1", "LocalCPUBackend"],
        "new_position": ["lmcache_instance_2", "LocalCPUBackend"],
        "tokens": [128000, 849, 21435, 279, 26431, 315, 85748, 6636, 304, 4221, 4211, 13]
      }'

控制器会回复类似于以下内容的消息：

{"num_tokens": 12, "event_id": "xxx"}

num_tokens 表示正在移动多少个 token 的 KV Cache。返回的 event_id 可用于查询操作的状态。