1p1d#
One Prefiller, One Decoder (1p1d) Example#
This example demonstrates how to run LMCache with disaggregated prefill using NIXL on a single node with a 1 prefiller + 1 decoder setup. This configuration separates the compute-intensive prefill operations from the decode operations, allowing for better resource utilization and performance optimization.
Architecture Overview#
The 1p1d setup consists of three main components:
Prefiller Server - Handles the prefill phase of inference (initial prompt processing)
Decoder Server - Handles the decode phase of inference (token generation)
Proxy Server - Coordinates requests between the prefiller and decoder
┌─────────────┐
│ Client │
└─────┬───────┘
│
┌───────▼───────┐
│ Proxy Server │
│ Port 9000 │
└───┬───────┬───┘
│ │
┌────────▼──┐ ┌─▼────────┐
│ Prefiller │ │ Decoder │
│Port 8100 │ │Port 8200 │
│ GPU 0 │ │ GPU 1 │
└───────────┘ └──────────┘
│ ▲
│ │
└───────┘
NIXL Transfer
Prerequisites#
LMCache: Install with
pip install lmcache
NIXL: Install from NIXL GitHub repository
Hardware: At least 2 GPUs
Model Access: Valid Hugging Face token (HF_TOKEN) for Llama 3.1 8B Instruct
Quick Start#
Set your Hugging Face token:
export HF_TOKEN=hf_your_token_here
Navigate to the example directory:
cd examples/disagg_prefill/1p1d
Run the example:
bash disagg_example_nixl.sh
The script will automatically:
Launch a prefiller instance on port 8100 (GPU 0)
Launch a decoder instance on port 8200 (GPU 1)
Launch a proxy server on port 9000
Wait for all servers to be ready
Press Ctrl+C
to stop all servers.
Configuration#
Prefiller Configuration#
The prefiller is configured via configs/lmcache-prefiller-config.yaml
:
local_cpu: False
max_local_cpu_size: 0
max_local_disk_size: 0
remote_serde: NULL
enable_nixl: True
nixl_role: "sender"
nixl_peer_host: "localhost"
nixl_peer_port: 55555
nixl_buffer_size: 1073741824 # 1GB
nixl_buffer_device: "cuda"
nixl_enable_gc: True
Key settings:
- nixl_role: "sender"
- Configures this instance to send KV cache data
- nixl_buffer_size: 1GB
- Buffer size for NIXL transfers
- nixl_buffer_device: "cuda"
- Uses GPU memory for buffering
Decoder Configuration#
The decoder is configured via configs/lmcache-decoder-config.yaml
:
local_cpu: False
max_local_cpu_size: 0
max_local_disk_size: 0
remote_serde: NULL
enable_nixl: True
nixl_role: "receiver"
nixl_peer_host: "localhost"
nixl_peer_port: 55555
nixl_buffer_size: 1073741824 # 1GB
nixl_buffer_device: "cuda"
nixl_enable_gc: True
Key settings:
- nixl_role: "receiver"
- Configures this instance to receive KV cache data
- Same buffer configuration as the prefiller for compatibility
Components Deep Dive#
Proxy Server (disagg_proxy_server.py)#
The proxy server coordinates the disaggregated prefill workflow:
Request Handling: Receives client requests on port 9000
Prefill Coordination: Sends requests to the prefiller with
max_tokens=1
Response Streaming: Streams the full response from the decoder
Performance Monitoring: Tracks Time-To-First-Token (TTFT) statistics
Supported endpoints:
- /v1/completions
- /v1/chat/completions
vLLM Server Launcher (disagg_vllm_launcher.sh)#
This script launches individual vLLM servers with appropriate configurations:
Prefiller Launch Command:
UCX_TLS=cuda_ipc,cuda_copy,tcp \
LMCACHE_CONFIG_FILE=configs/lmcache-prefiller-config.yaml \
VLLM_ENABLE_V1_MULTIPROCESSING=1 \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
CUDA_VISIBLE_DEVICES=0 \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--port 8100 \
--disable-log-requests \
--enforce-eager \
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_producer",...}'
Decoder Launch Command:
UCX_TLS=cuda_ipc,cuda_copy,tcp \
LMCACHE_CONFIG_FILE=configs/lmcache-decoder-config.yaml \
VLLM_ENABLE_V1_MULTIPROCESSING=1 \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
CUDA_VISIBLE_DEVICES=1 \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--port 8200 \
--disable-log-requests \
--enforce-eager \
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_consumer",...}'
Testing and Benchmarking#
Basic Test#
Once all servers are running, you can test with a simple curl command:
curl -X POST http://localhost:9000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "The future of AI is",
"max_tokens": 50,
"temperature": 0.7
}'
Performance Benchmarking#
For comprehensive performance testing, use vLLM’s benchmark tool:
python benchmark_serving.py --port 9000 --seed $(date +%s) \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random --random-input-len 7500 --random-output-len 200 \
--num-prompts 30 --burstiness 100 --request-rate 1 --ignore-eos
This benchmark: - Sends requests to port 9000 (proxy server) - Uses random prompts with 7500 input tokens - Generates 200 output tokens per request - Tests with 30 total prompts at 1 request/second
Log Files and Monitoring#
The example generates three log files for monitoring:
prefiller.log
- Prefiller server logs and errorsdecoder.log
- Decoder server logs and errorsproxy.log
- Proxy server logs and TTFT statistics
The proxy server automatically calculates and displays TTFT statistics every 5 seconds:
===============================
Num requests: 10
Prefill node TTFT stats:
- Average (ms): 45.2
- Median (ms): 43.1
- 99th Percentile (ms): 52.8
===============================
Troubleshooting#
Common Issues#
GPU Memory: Ensure each GPU has sufficient memory for the model
NIXL Installation: Verify NIXL is properly installed and accessible
Port Conflicts: Check that ports 8100, 8200, and 9000 are available
HF Token: Ensure your Hugging Face token has access to Llama models
Error Recovery#
If any server fails to start:
Check the corresponding log file for error details
Verify GPU availability with
nvidia-smi
Ensure all dependencies are installed
Try restarting with
Ctrl+C
followed by re-running the script