Example: Share KV cache across multiple LLMs#
LMCache should be able to reduce the generation time of the second and following calls.
We have examples for the following types of across-instance KV cache sharing:
KV cache sharing through a centralized cache server:
centralized_sharing
KV cache sharing through p2p cache transfer:
p2p_sharing
Prerequisites#
Your server should have at least 2 GPUs.
For Centralized sharing, this will use the port 8000 and 8001 (for vLLM) and port 65432 (for LMCache).
For P2P sharing, this will use the port 8000 and 8001 for 2 vllms, And will use port 8200 and 8201 for 2 distributed cache servers, And will use port 8100 for lookup server.
Centralized KV cache sharing#
This section demonstrates how to share KV cache across multiple vLLM instances using a centralized LMCache server.
Setup centralized sharing#
First, create a configuration file named lmcache_config.yaml
with the following content:
chunk_size: 256
local_cpu: true
remote_url: "lm://localhost:65432"
remote_serde: "cachegen"
Run centralized sharing example#
Start the LMCache centralized server,
lmcache_server localhost 65432
In a different terminal,
LMCACHE_CONFIG_FILE=lmcache_config.yaml \
CUDA_VISIBLE_DEVICES=0 \
vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
--gpu-memory-utilization 0.8 \
--port 8000 --kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
In another terminal,
LMCACHE_CONFIG_FILE=lmcache_config.yaml \
CUDA_VISIBLE_DEVICES=1 \
vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
--gpu-memory-utilization 0.8 \
--port 8001 \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
Wait until both engines are ready.
Send one request to the engine at port 8000,
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "Explain the significance of KV cache in language models.",
"max_tokens": 10
}'
Send the same request to the engine at port 8001,
curl -X POST http://localhost:8001/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "Explain the significance of KV cache in language models.",
"max_tokens": 10
}'
The second request will automatically retrieve and reuse the KV cache from the first instance, significantly reducing generation time.
P2P KV cache sharing#
This section demonstrates how to share KV cache across multiple vLLM instances using peer-to-peer transfer.
Setup P2P sharing#
Create two configuration files for the P2P sharing setup:
Instance 1 configuration (lmcache_config1.yaml
):
chunk_size: 256
local_cpu: true
max_local_cpu_size: 5
# P2P configuration
enable_p2p: true
lookup_url: "localhost:8100"
distributed_url: "localhost:8200"
Instance 2 configuration (lmcache_config2.yaml
):
chunk_size: 256
local_cpu: true
max_local_cpu_size: 5
# P2P configuration
enable_p2p: true
lookup_url: "localhost:8100"
distributed_url: "localhost:8201"
Run P2P sharing example#
Pull redis docker and start lookup server at port 8100:
docker pull redis
docker run --name lmcache-redis -d -p 8100:6379 redis
Start two vllm engines:
Start vllm engine 1 at port 8000:
CUDA_VISIBLE_DEVICES=0 \
LMCACHE_CONFIG_FILE=lmcache_config1.yaml \
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--max-model-len 4096 \
--gpu-memory-utilization 0.8 \
--port 8000 \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
Start vllm engine 2 at port 8001:
CUDA_VISIBLE_DEVICES=1 \
LMCACHE_CONFIG_FILE=lmcache_config2.yaml \
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--max-model-len 4096 \
--gpu-memory-utilization 0.8 \
--port 8001 \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
Note that the two distributed cache servers will start at port 8200 and 8201.
Send request to vllm engine 1:
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "Explain the significance of KV cache in language models.",
"max_tokens": 100
}'
Send request to vllm engine 2:
curl -X POST http://localhost:8001/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "Explain the significance of KV cache in language models.",
"max_tokens": 100
}'
The cache will be automatically retrieved from vllm engine 1.