P2P KV Cache Sharing#

This is an example to demonstrate P2P KV cache sharing.

Prerequisites#

Your server should have at least 2 GPUs.

This will use the following ports:

8000 and 8001 for 2 vLLMs
8200 and 8201 for 2 distributed cache servers
8100 for the lookup server

Steps#

Pull Redis Docker and Start Lookup Server at Port 8100

docker pull redis
docker run --name some-redis -d -p 8100:6379 redis

Start Two vLLM Engines

Start vLLM engine 1 at port 8000:

CUDA_VISIBLE_DEVICES=0 LMCACHE_USE_EXPERIMENTAL=True LMCACHE_CONFIG_FILE=example1.yaml vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --max-model-len 4096  --gpu-memory-utilization 0.8 --port 8000 --kv-transfer-config '{"kv_connector":"LMCacheConnector", "kv_role":"kv_both"}'

Start vLLM engine 2 at port 8001:

CUDA_VISIBLE_DEVICES=1 LMCACHE_USE_EXPERIMENTAL=True LMCACHE_CONFIG_FILE=example2.yaml vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --max-model-len 4096  --gpu-memory-utilization 0.8 --port 8001 --kv-transfer-config '{"kv_connector":"LMCacheConnector", "kv_role":"kv_both"}'

Note that the two distributed cache servers will start at ports 8200 and 8201.

Send Request to vLLM Engine 1

curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "prompt": "Explain the significance of KV cache in language models.",
    "max_tokens": 10
  }'

Send Request to vLLM Engine 2

curl -X POST http://localhost:8001/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "prompt": "Explain the significance of KV cache in language models.",
    "max_tokens": 10
  }'

The cache will be automatically retrieved from vLLM engine 1.

Disaggregated prefill

LMCache v0