Example: Share KV cache across multiple LLMs#
LMCache should be able to reduce the generation time of the second and following calls.
We have examples for the following types of across-instance KV cache sharing:
KV cache sharing through a centralized cache server:
centralized_sharing
KV cache sharing through p2p cache transfer:
p2p_sharing
Prerequisites#
Your server should have at least 2 GPUs.
For Centralized sharing, this will use the port 8000 and 8001 for 2 vllms, And will use port 8200 and 8201 for 2 distributed cache servers, And will use port 8100 for lookup server.
For P2P sharing, this will use the port 8000 and 8001 (for vLLM) and port 65432 (for LMCache).
Centralized KV cache sharing#
This section demonstrates how to share KV cache across multiple vLLM instances using a centralized LMCache server.
Setup centralized sharing#
First, create a configuration file named lmcache_config.yaml
with the following content:
chunk_size: 256
local_cpu: true
remote_url: "lm://localhost:65432"
remote_serde: "cachegen"
# Whether retrieve() is pipelined or not
pipelined_backend: false
Run centralized sharing example#
Start the LMCache centralized server,
lmcache_server localhost 65432
In a different terminal,
LMCACHE_CONFIG_FILE=example.yaml \
CUDA_VISIBLE_DEVICES=0 \
vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
--gpu-memory-utilization 0.8 \
--port 8000 --kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
In another terminal,
LMCACHE_CONFIG_FILE=example.yaml \
CUDA_VISIBLE_DEVICES=1 \
vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
--gpu-memory-utilization 0.8 \
--port 8001 \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
Wait until both engines are ready.
Send one request to the engine at port 8000,
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "Explain the significance of KV cache in language models.",
"max_tokens": 10
}'
Send the same request to the engine at port 8001,
curl -X POST http://localhost:8001/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "Explain the significance of KV cache in language models.",
"max_tokens": 10
}'
The second request will automatically retrieve and reuse the KV cache from the first instance, significantly reducing generation time.
P2P KV cache sharing#
This section demonstrates how to share KV cache across multiple vLLM instances using peer-to-peer transfer.
Setup P2P sharing#
Create two configuration files for the P2P sharing setup:
Instance 1 configuration (lmcache_config1.yaml
):
chunk_size: 256
local_cpu: true
max_local_cpu_size: 5
# P2P configuration
enable_p2p: true
lookup_url: "localhost:8100"
distributed_url: "localhost:8200"
Instance 2 configuration (lmcache_config2.yaml
):
chunk_size: 256
local_cpu: true
max_local_cpu_size: 5
# P2P configuration
enable_p2p: true
lookup_url: "localhost:8100"
distributed_url: "localhost:8201"
Run P2P sharing example#
Pull redis docker and start lookup server at port 8100:
docker pull redis
docker run --name lmcache-redis -d -p 8100:6379 redis
Start two vllm engines:
Start vllm engine 1 at port 8000:
CUDA_VISIBLE_DEVICES=0 \
LMCACHE_CONFIG_FILE=lmcache_config1.yaml \
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--max-model-len 4096 \
--gpu-memory-utilization 0.8 \
--port 8000 \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
Start vllm engine 2 at port 8001:
CUDA_VISIBLE_DEVICES=1 \
LMCACHE_CONFIG_FILE=lmcache_config2.yaml \
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--max-model-len 4096 \
--gpu-memory-utilization 0.8 \
--port 8001 \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
Note that the two distributed cache servers will start at port 8200 and 8201.
Send request to vllm engine 1:
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "Explain the significance of KV cache in language models.",
"max_tokens": 100
}'
Send request to vllm engine 2:
curl -X POST http://localhost:8001/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "Explain the significance of KV cache in language models.",
"max_tokens": 100
}'
The cache will be automatically retrieved from vllm engine 1.