XpYd#
X Prefiller, Y Decoder (XpYd) Example#
This example demonstrates how to run LMCache with disaggregated prefill using NIXL on a single node with multiple prefiller and decoder instances. This configuration allows for horizontal scaling of both the compute-intensive prefill operations and the decode operations, enabling better resource utilization and higher throughput.
Architecture Overview#
The XpYd setup consists of multiple components that can be scaled independently:
Multiple Prefiller Servers - Handle the prefill phase of inference (initial prompt processing)
Multiple Decoder Servers - Handle the decode phase of inference (token generation)
Proxy Server - Coordinates requests between prefillers and decoders using round-robin load balancing
Example 2p2d Architecture:
┌─────────────┐
│ Client │
└─────┬───────┘
│
┌───────▼────────┐
│ Proxy Server │
│ Port 9100 │──────────────────────|
│ (Round-Robin) │ |
└───▲────────▲───┘ |
│ │ |
┌────────▼───┐ ┌─▼──────────┐ |
│ Prefiller1 │ │ Prefiller2 │ |
│ Port 7100 │ │ Port 7101 │ |
│ GPU 0 │ │ GPU 1 │ |
└─────▲──────┘ └─────▲──────┘ |
│ │ |
│ NIXL transfer | |
│ │ |
┌────▼───────┐ ┌────▼──────┐ |
│ Decoder 1 │ │ Decoder 2 │ |
│ Port 7200 │ │ Port 7201 │ |
│ GPU 2 │ │ GPU 3 │ |
└────▲───────┘ └────▲──────┘ |
│ │ |
└───────────────┴──────────────────────|
Prerequisites#
LMCache: Install with
pip install lmcache
NIXL: Install from NIXL GitHub repository
Hardware: At least 4 GPUs (2 for prefillers + 2 for decoders in 2p2d setup)
Model Access: Valid Hugging Face token (HF_TOKEN) for Llama 3.1 8B Instruct
Quick Start (2p2d Example)#
Set your Hugging Face token:
export HF_TOKEN=hf_your_token_here
Navigate to the example directory:
cd examples/disagg_prefill/xpyd_experimental
Run the example:
bash disagg_example_xpyd.sh
The script will automatically:
Launch two decoder instances on port 8200 and 8201 (GPU 2 and GPU 3)
Launch two prefiller instances on ports 7100 and 7101 (GPU 0 and GPU 1)
Launch a proxy server on port 9100 with round-robin load balancing
Wait for all servers to be ready
Press Ctrl+C
to stop all servers.
Configuration#
Important: For correct KV cache transfer, ensure all processes use the same PYTHONHASHSEED
to keep the hash of the KV cache consistent across processes:
export PYTHONHASHSEED=0
Prefiller Configuration#
All prefillers share the same configuration via configs/lmcache-prefiller-config.yaml
:
local_cpu: False
max_local_cpu_size: 0
max_local_disk_size: 0
enable_nixl: True
enable_xpyd: True
nixl_role: "sender"
nixl_proxy_host: "localhost"
nixl_proxy_port: 7500
nixl_buffer_size: 1073741824 # 1GB
nixl_buffer_device: "cuda"
Key settings:
nixl_role: "sender"
- Configures these instances to send KV cache datanixl_buffer_size: 1073741824 # 1GB
- Buffer size for NIXL transfersnixl_buffer_device: "cuda"
- Uses GPU memory for buffering
Decoder Configuration#
The decoder(s) are configured via configs/lmcache-decoder-x-config.yaml
:
local_cpu: False
max_local_cpu_size: 0
enable_nixl: True
enable_xpyd: True
nixl_role: "receiver"
nixl_peer_host: "localhost"
nixl_peer_init_port: 730x
nixl_peer_alloc_port: 740x
nixl_buffer_size: 2147483648 # 2GB
nixl_buffer_device: "cuda"
Key settings:
nixl_role: "receiver"
- Configures these instances to receive KV cache datanixl_buffer_size: 2147483648 # 2GB
- Buffer size for NIXL transfersnixl_buffer_device: "cuda"
- Uses GPU memory for buffering
Components Deep Dive#
Proxy Server (disagg_proxy_server.py)#
The proxy server coordinates the multi-prefiller disaggregated workflow:
Request Handling: Receives client requests on port 9000
Load Balancing: Distributes requests across multiple prefillers using round-robin
Prefill Coordination: Sends requests to prefillers with
max_tokens=1
Prefill Response: Receives prefiller that says nixl transfer is done
Response Streaming: Streams the full response from the decoder
Performance Monitoring: Tracks Time-To-First-Token (TTFT) statistics
Key features:
- Round-robin distribution: Balances load across --num-prefillers
instances
- Fault tolerance: Handles prefiller failures gracefully
- Monitoring: Provides detailed TTFT statistics for each prefiller
Supported endpoints:
- /v1/completions
- /v1/chat/completions
vLLM Server Launcher (disagg_vllm_launcher.sh)#
This script launches individual vLLM servers with appropriate configurations:
Prefiller1 Launch Command:
UCX_TLS=cuda_ipc,cuda_copy,tcp \
LMCACHE_CONFIG_FILE=$prefill_config_file \
VLLM_ENABLE_V1_MULTIPROCESSING=1 \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
CUDA_VISIBLE_DEVICES=0 \
vllm serve $MODEL \
--port 7100 \
--disable-log-requests \
--enforce-eager \
--no-enable-prefix-caching \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_producer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "producer1"}}'
Prefiller2 Launch Command:
UCX_TLS=cuda_ipc,cuda_copy,tcp \
LMCACHE_CONFIG_FILE=$prefill_config_file \
VLLM_ENABLE_V1_MULTIPROCESSING=1 \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
CUDA_VISIBLE_DEVICES=1 \
vllm serve $MODEL \
--port 7101 \
--disable-log-requests \
--enforce-eager \
--no-enable-prefix-caching \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_producer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "producer2"}}'
Decoder1 Launch Command:
UCX_TLS=cuda_ipc,cuda_copy,tcp \
LMCACHE_CONFIG_FILE=$decode_config_file \
VLLM_ENABLE_V1_MULTIPROCESSING=1 \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
CUDA_VISIBLE_DEVICES=2 \
vllm serve $MODEL \
--port 7200 \
--disable-log-requests \
--enforce-eager \
--no-enable-prefix-caching \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "consumer1", "skip_last_n_tokens": 1}}'
Decoder2 Launch Command:
UCX_TLS=cuda_ipc,cuda_copy,tcp \
LMCACHE_CONFIG_FILE=$decode_config_file \
VLLM_ENABLE_V1_MULTIPROCESSING=1 \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
CUDA_VISIBLE_DEVICES=3 \
vllm serve $MODEL \
--port 7201 \
--disable-log-requests \
--enforce-eager \
--no-enable-prefix-caching \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer","kv_connector_extra_config": {"discard_partial_chunks": false, "lmcache_rpc_port": "consumer2", "skip_last_n_tokens": 1}}'
Key differences from 1p1d:
- Each prefiller gets a unique lmcache_rpc_port
(producer1, producer2, etc.)
- Each prefiller runs on a different GPU (CUDA_VISIBLE_DEVICES)
- Different ports for each prefiller (7100, 7101, etc.)
- Different ports for each decoder (7200, 7201, etc.)
Basic Test#
Once all servers are running, you can test with a simple curl command:
curl -s -N -X POST http://127.0.0.1:9100/v1/completions -H "Content-Type: application/json" -d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "What date is today?",
"max_tokens": 20,
"temperature": 0.0
}'
Performance Benchmarking#
For comprehensive performance testing, use vLLM’s benchmark tool:
python benchmark_serving.py --port 9100 --seed $(date +%s) \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random --random-input-len 7500 --random-output-len 200 \
--num-prompts 30 --burstiness 100 --request-rate 1 --ignore-eos
Expected performance improvements with 2p2d: - Higher throughput: Multiple prefillers can handle more concurrent requests - Better TTFT: Load balancing reduces queuing delays - Improved utilization: Better GPU utilization across multiple devices
Sample benchmark results:
============ Serving Benchmark Result ============
Successful requests: 30
Benchmark duration (s): 31.34
Total input tokens: 224970
Total generated tokens: 6000
Request throughput (req/s): 0.96
Output token throughput (tok/s): 191.44
Total Token throughput (tok/s): 7369.36
---------------Time to First Token----------------
Mean TTFT (ms): 313.41
Median TTFT (ms): 272.83
P99 TTFT (ms): 837.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 8.84
Median TPOT (ms): 8.72
P99 TPOT (ms): 11.35
---------------Inter-token Latency----------------
Mean ITL (ms): 8.84
Median ITL (ms): 8.61
P99 ITL (ms): 11.43
==================================================
Log Files and Monitoring#
The example generates multiple log files for comprehensive monitoring:
prefiller1.log
- First prefiller server logs and errorsprefiller2.log
- Second prefiller server logs and errorsdecoder1.log
- First decoder server logs and errorsdecoder1.log
- First decoder server logs and errorsproxy.log
- Proxy server logs and TTFT statistics
The proxy server provides detailed statistics for each prefiller:
===============================
Num requests: 20
Prefiller 1 TTFT stats:
- Average (ms): 42.3
- Median (ms): 40.1
- 99th Percentile (ms): 48.7
Prefiller 2 TTFT stats:
- Average (ms): 43.8
- Median (ms): 41.5
- 99th Percentile (ms): 52.1
===============================
This helps identify performance differences between prefiller instances and optimize load balancing.
Troubleshooting#
Common Issues#
GPU Memory: Ensure each GPU has sufficient memory for the model
NIXL Installation: Verify NIXL is properly installed and accessible
Port Conflicts: Check that all required ports are available
HF Token: Ensure your Hugging Face token has access to Llama models
GPU Assignment: Verify CUDA_VISIBLE_DEVICES assignments don’t conflict
Multi-Instance Specific Issues#
Uneven Load: Monitor prefiller statistics to ensure balanced distribution
Resource Contention: Watch for GPU memory pressure with multiple instances
Network Bottlenecks: Monitor NIXL transfer performance between instances
Startup Timing: Stagger prefiller launches to avoid resource conflicts