P2P KV Cache Sharing#
P2P (Peer-to-Peer) KV cache sharing enables direct cache transfer between multiple serving engine instances without requiring a centralized cache server. This approach provides high-performance cache sharing with reduced latency and improved scalability, especially beneficial in distributed inference scenarios.
LMCache supports P2P sharing through a controller-based architecture using NIXL (NVIDIA Inference Xfer Library) for optimized data transfer between instances.
Prerequisites#
Multi-GPU Setup: Your server should have at least 2 GPUs
NIXL: Install from NIXL
LMCache: Install from Installation
Configuration#
Create two configuration files for the P2P sharing setup.
The only difference between the two configurations is the lmcache_instance_id
and the p2p_init_ports
and p2p_lookup_ports
and lmcache_worker_ports
.
Instance 1 Configuration (example1.yaml):
chunk_size: 256
local_cpu: True
max_local_cpu_size: 5
enable_async_loading: True
# P2P configurations
enable_p2p: True
p2p_host: "localhost"
p2p_init_ports: 8200
p2p_lookup_ports: 8201
transfer_channel: "nixl"
# Controller configurations
enable_controller: True
lmcache_instance_id: "lmcache_instance_1"
controller_pull_url: "localhost:8300"
controller_reply_url: "localhost:8400"
lmcache_worker_ports: 8500
extra_config:
lookup_backoff_time: 0.001
Instance 2 Configuration (example2.yaml):
chunk_size: 256
local_cpu: True
max_local_cpu_size: 5
enable_async_loading: True
# P2P configurations
enable_p2p: True
p2p_host: "localhost"
p2p_init_ports: 8202
p2p_lookup_ports: 8203
transfer_channel: "nixl"
# Controller configurations
enable_controller: True
lmcache_instance_id: "lmcache_instance_2"
controller_pull_url: "localhost:8300"
controller_reply_url: "localhost:8400"
lmcache_worker_ports: 8501
extra_config:
lookup_backoff_time: 0.001
Setup and Usage#
Step 1: Start the LMCache Controller
PYTHONHASHSEED=123 lmcache_controller --host localhost --port 9000 --monitor-ports '{"pull": 8300, "reply": 8400}'
Make sure that the 8300 and 8400 ports are set up in controller_pull_url and controller_reply_url in the configuration files. Port 9000 is the controller main port, which is arbitrary and can be changed.
Step 2: Start vLLM Engines with LMCache Workers
Start vLLM engine 1 at port 8010:
PYTHONHASHSEED=123 UCX_TLS=rc CUDA_VISIBLE_DEVICES=0 LMCACHE_CONFIG_FILE=example1.yaml \
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--gpu-memory-utilization 0.8 \
--port 8010 \
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
Start vLLM engine 2 at port 8011:
PYTHONHASHSEED=123 UCX_TLS=rc CUDA_VISIBLE_DEVICES=1 LMCACHE_CONFIG_FILE=example2.yaml \
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--gpu-memory-utilization 0.8 \
--port 8011 \
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
Step 3: Test P2P Cache Sharing
Send a request to vLLM engine 1 to populate the cache:
curl -X POST http://localhost:8010/v1/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",
\"prompt\": \"$(printf 'Explain the significance of KV cache in language models.%.0s' {1..100})\",
\"max_tokens\": 10
}"
Send the same request to vLLM engine 2 to demonstrate cache retrieval from engine 1:
curl -X POST http://localhost:8011/v1/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",
\"prompt\": \"$(printf 'Explain the significance of KV cache in language models.%.0s' {1..100})\",
\"max_tokens\": 10
}"
Expected Output#
When the second request successfully retrieves cache from the first instance, you should see logs similar to:
(EngineCore_DP0 pid=2577584)[2025-09-21 00:00:11,706] LMCache INFO:[0m Established connection to peer_init_url localhost:8200. The peer_lookup_url: localhost:8201 (p2p_backend.py:278:lmcache.v1.storage_backend.p2p_backend)
(EngineCore_DP0 pid=2577584)[2025-09-21 00:00:11,792] LMCache INFO: Retrieved 1002 out of total 1002 out of total 1002 tokens. size: 0.1223 gb, cost 60.3595 ms, throughput: 2.0264 GB/s; (cache_engine.py:496:lmcache.v1.cache_engine)
These logs indicate successful P2P connection establishment and high-throughput cache retrieval.