Deployment Guide#
This page covers deploying LMCache multiprocess mode in Docker and Kubernetes environments, along with production best practices.
Docker#
LMCache container:
docker run --runtime nvidia --gpus all \
--network host \
--ipc host \
lmcache/standalone:nightly \
/opt/venv/bin/python3 -m lmcache.v1.multiprocess.server \
--l1-size-gb 60 --eviction-policy LRU --max-workers 4 --port 6555
vLLM container:
docker run --runtime nvidia --gpus all \
--network host \
--ipc host \
lmcache/vllm-openai:latest-nightly \
Qwen/Qwen3-14B \
--kv-transfer-config \
'{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both", "kv_connector_extra_config": {"lmcache.mp.port": 6555}}'
Required Docker flags:
--network host– Allows the vLLM container to reach LMCache on localhost.--ipc host– Required for CUDA IPC shared memory transfers between containers.--runtime nvidia --gpus all– GPU access via the NVIDIA container runtime.
HTTP server variant:
For health-check support (useful with container orchestrators), use the HTTP server entry point:
docker run --runtime nvidia --gpus all \
--network host \
--ipc host \
lmcache/standalone:nightly \
/opt/venv/bin/python3 -m lmcache.v1.multiprocess.http_server \
--l1-size-gb 60 --eviction-policy LRU --max-workers 4 --port 6555
Kubernetes#
LMCache is designed for a DaemonSet + Deployment pattern: one LMCache server per node (DaemonSet) shared by multiple vLLM pods (Deployment).
Example YAML files are provided in examples/multi_process/.
Prerequisites#
Kubernetes cluster with GPU support (NVIDIA GPU Operator installed)
At least 4 GPUs per node
kubectlconfigured to access your cluster
Step-by-Step#
Step 1: Create namespace
kubectl create namespace multi-process
Step 2: Deploy LMCache DaemonSet
kubectl apply -f examples/multi_process/lmcache-daemonset.yaml
Step 3: Deploy vLLM
kubectl apply -f examples/multi_process/vllm-deployment.yaml
Note
The default model is Qwen/Qwen3-14B. For gated models (e.g., Llama),
create a Secret with your Hugging Face token:
kubectl create secret generic vllm-secrets \
--from-literal=hf_token=your_hf_token_here \
-n multi-process
Then add the HF_TOKEN environment variable to the vLLM container spec.
Step 4: Monitor deployment
# DaemonSet status
kubectl get daemonset -n multi-process
kubectl get pods -n multi-process -l app=lmcache-server
# vLLM status
kubectl get pods -n multi-process -l app=vllm-deployment -w
# LMCache logs (for a specific node)
VLLM_NODE=$(kubectl get pod -n multi-process -l app=vllm-deployment \
-o jsonpath='{.items[0].spec.nodeName}')
LMCACHE_POD=$(kubectl get pod -n multi-process -l app=lmcache-server \
--field-selector spec.nodeName=$VLLM_NODE \
-o jsonpath='{.items[0].metadata.name}')
kubectl logs -n multi-process $LMCACHE_POD -f
Step 5: Send test requests
kubectl port-forward -n multi-process deployment/vllm-deployment 8000:8000
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"Qwen/Qwen3-14B\",
\"prompt\": \"$(printf 'Explain the significance of KV cache in language models.%.0s' {1..100})\",
\"max_tokens\": 10
}"
Architecture Notes#
DaemonSet uses ``hostNetwork: true`` so vLLM pods discover the LMCache server via
status.hostIP.Both containers mount ``/dev/shm`` from the host to enable CUDA IPC memory sharing.
GPUs are NOT requested in the DaemonSet – this allows GPUs to remain exclusively allocated to vLLM pods. The NVIDIA container runtime automatically provides GPU access for IPC-based memory transfers.
Multiple vLLM pods on the same node automatically connect to the same LMCache DaemonSet instance.
Note
LMCache pods on nodes without GPUs will crash with CUDA initialization errors. This is expected – LMCache only needs to run on GPU nodes where vLLM pods are scheduled.
Health Checking (HTTP Server)#
For Kubernetes liveness/readiness probes, deploy the HTTP server variant
instead. Use the /api/healthcheck endpoint:
livenessProbe:
httpGet:
path: /api/healthcheck
port: 8000
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /api/healthcheck
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
Monitoring Integration#
Prometheus metrics are enabled by default on port 9090. Add a
ServiceMonitor or Prometheus scrape annotation to collect metrics from the
LMCache DaemonSet pods. See Observability for metric details.
Cleanup#
kubectl delete -f examples/multi_process/vllm-deployment.yaml
kubectl delete -f examples/multi_process/lmcache-daemonset.yaml
kubectl delete namespace multi-process
Production Best Practices#
Worker count (``–max-workers``): Start with 1 (default). Increase to 2–4 if you see ZMQ request queuing with multiple vLLM pods.
L1 memory sizing (``–l1-size-gb``): Allocate as much CPU memory as available after accounting for the OS and vLLM. A larger L1 cache means fewer L2 round-trips.
Eviction tuning:
--eviction-trigger-watermark 0.8(default) triggers eviction when L1 is 80% full.--eviction-ratio 0.2(default) frees 20% of allocated memory per eviction cycle.Lower the watermark or increase the ratio if you observe frequent evictions under steady load.
Logging:
Use LMCACHE_LOG_LEVEL=DEBUG during initial setup to verify L2 store/load
activity. Switch to INFO (default) for production to reduce log volume.