Move the KV cache#
The move
interface is defined as the following:
move(old_position: Tuple[str, str], new_position: Tuple[str, str],
tokens: Optional[List[int]] = [], copy: Optional[bool] = False) -> event_id: str, num_tokens: int
The function moves the KV cache chunks identified by tokens
from
old_position
to new_position
. Each position is a tuple of
(instance_id, location)
. Setting copy
to True
copies the
KV cache instead of moving it.
Note that NIXL is required to be installed for P2P transfer. We’ll support other transports later such as Python socket and Mooncake.
Example usage:#
First, prepare two yaml files instance1.yaml
and instance2.yaml
to
configure two lmcache instances:
# instance1.yaml
chunk_size: 256
local_cpu: True
max_local_cpu_size: 5
# cache controller configurations
enable_controller: True
lmcache_instance_id: "lmcache_instance_1"
controller_pull_url: "localhost:8300"
controller_reply_url: "localhost:8400"
lmcache_worker_ports: 8500
# P2P configurations
enable_p2p: True
p2p_host: "localhost"
p2p_init_ports: 8200
p2p_lookup_ports: 8201
transfer_channel: "nixl"
# instance2.yaml
chunk_size: 256
local_cpu: True
max_local_cpu_size: 5
# cache controller configurations
enable_controller: True
lmcache_instance_id: "lmcache_instance_1"
controller_pull_url: "localhost:8300"
controller_reply_url: "localhost:8400"
lmcache_worker_ports: 8501
# P2P configurations
enable_p2p: True
p2p_host: "localhost"
p2p_init_ports: 8202
p2p_lookup_ports: 8203
transfer_channel: "nixl"
Start two vllm engines:
PYTHONHASHSEED=123 UCX_TLS=rc CUDA_VISIBLE_DEVICES=0 LMCACHE_CONFIG_FILE=instance1.yaml vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 4096 \
--gpu-memory-utilization 0.8 --port 8000 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
PYTHONHASHSEED=123 UCX_TLS=rc CUDA_VISIBLE_DEVICES=1 LMCACHE_CONFIG_FILE=instance2.yaml vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 4096 \
--gpu-memory-utilization 0.8 --port 8001 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
Start the lmcache controller at port 9000 and the monitor at port 9001:
PYTHONHASHSEED=123 lmcache_controller --host localhost --port 9000 --monitor-ports '{"pull": 8300, "reply": 8400}'
Send a request to vllm engine 1:
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "Explain the significance of KV cache in language models.",
"max_tokens": 10
}'
Tokenize the prompt to obtain token ids:
curl -X POST http://localhost:8000/tokenize \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "Explain the significance of KV cache in language models."
}'
Move the KV cache from engine 1’s CPU to engine 2’s CPU using the token ids:
curl -X POST http://localhost:9000/move \
-H "Content-Type: application/json" \
-d '{
"old_position": ["lmcache_instance_1", "LocalCPUBackend"],
"new_position": ["lmcache_instance_2", "LocalCPUBackend"],
"tokens": [128000, 849, 21435, 279, 26431, 315, 85748, 6636, 304, 4221, 4211, 13]
}'
The controller responds with a message similar to:
{"num_tokens": 12, "event_id": "xxx"}
num_tokens
indicates how many tokens’ KV cache are being moved. The
returned event_id
can be used to query the status of the operation.