Kubernetes Operator#
The LMCache Kubernetes operator automates the deployment and lifecycle
management of LMCache multiprocess servers. Instead of hand-writing
DaemonSets, Services, and ConfigMaps (as described in the manual
Deployment Guide guide), you declare a single LMCacheEngine custom
resource and the operator reconciles all underlying Kubernetes objects.
Why Use the Operator#
The manual DaemonSet approach works, but it has sharp edges the operator eliminates:
Auto-injected pod settings – The operator always sets
hostIPC: trueand--host 0.0.0.0. ForgettinghostIPCin a hand-written manifest causes silent CUDA IPC failures (cudaErrorMapBufferObjectFailed) that are hard to debug.Node-local service discovery – The operator creates a ClusterIP Service with
internalTrafficPolicy=Localand a connection ConfigMap that vLLM pods simply mount. NohostNetwork, no Downward API, no shell variable substitution.Auto-computed resource sizing – Memory requests and limits are derived from
l1.sizeGB, avoiding OOM kills (under-provisioned) or wasted node capacity (over-provisioned).Declarative Prometheus integration – Set
prometheus.serviceMonitor.enabled: trueand the operator creates aServiceMonitorCR that the Prometheus Operator discovers automatically.CRD validation – OpenAPI schema validation catches misconfigurations (e.g.,
l1.sizeGB <= 0, invalid port range) atkubectl applytime, before any pods are created.
Prerequisites#
Kubernetes 1.20+
kubectlconfigured to access your cluster(Optional) Prometheus Operator for ServiceMonitor support
Installing the Operator#
Option A: One-line install from release (recommended)
# Latest stable release
kubectl apply -f https://github.com/LMCache/LMCache/releases/download/operator-latest/install.yaml
# Or nightly build from the dev branch
kubectl apply -f https://github.com/LMCache/LMCache/releases/download/operator-nightly-latest/install.yaml
Option B: Build from source
cd operator
make build
make install
make deploy IMG=<your-registry>/lmcache-operator:latest
Deploying an LMCacheEngine#
A minimal CR deploys a DaemonSet with 60 GB L1 cache on every GPU node:
apiVersion: lmcache.lmcache.ai/v1alpha1
kind: LMCacheEngine
metadata:
name: my-cache
spec:
l1:
sizeGB: 60
kubectl apply -f lmcache-engine.yaml
The operator automatically:
Creates a DaemonSet running one LMCache server pod per matched node
Sets
hostIPC: trueand passes--host 0.0.0.0to the serverCreates a node-local ClusterIP Service for vLLM discovery
Creates a connection ConfigMap (
my-cache-connection) with thekv-transfer-configJSON that vLLM needsAuto-computes resource requests/limits from the L1 cache size
Defaults
nodeSelectortonvidia.com/gpu.present: "true"
Note
The operator defaults the container image to lmcache/vllm-openai:latest.
Override with spec.image.repository and spec.image.tag to pin a
specific version.
Connecting vLLM#
The operator creates a ConfigMap named <engine-name>-connection containing
the kv-transfer-config JSON. Mount it in your vLLM Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
# Required for CUDA IPC between vLLM and LMCache
hostIPC: true
containers:
- name: vllm
image: lmcache/vllm-openai:latest
env:
# Deterministic hashing required by LMCache
- name: PYTHONHASHSEED
value: "0"
command: ["/bin/sh", "-c"]
args:
- |
exec python3 -m vllm.entrypoints.openai.api_server \
--model <your-model> \
--port 8000 \
--gpu-memory-utilization 0.8 \
--kv-transfer-config "$(cat /etc/lmcache/kv-transfer-config.json)"
ports:
- name: http
containerPort: 8000
volumeMounts:
- name: kv-transfer-config
mountPath: /etc/lmcache
readOnly: true
resources:
limits:
nvidia.com/gpu: "1"
volumes:
- name: kv-transfer-config
configMap:
name: my-cache-connection # <engine-name>-connection
Key requirements for vLLM pods:
hostIPC: true – CUDA IPC (
cudaIpcOpenMemHandle) needs a shared IPC namespace between vLLM and LMCache.PYTHONHASHSEED=0 – Ensures deterministic token hashing so vLLM and LMCache produce consistent cache keys.
ConfigMap mount – The
$(cat ...)pattern reads the connection JSON inline. The ConfigMap name is always<LMCacheEngine name>-connection.No hostNetwork needed – The operator’s node-local Service handles routing via
internalTrafficPolicy=Local.
Verifying the Deployment#
# Check LMCacheEngine status
kubectl get lmc
Expected output:
NAME PHASE READY DESIRED AGE
my-cache Running 3 3 5m
# Check the connection ConfigMap
kubectl get configmap my-cache-connection -o yaml
# Check LMCache pods
kubectl get pods -l app.kubernetes.io/managed-by=lmcache-operator
# Check detailed status with endpoints
kubectl describe lmc my-cache
CRD Spec Reference#
Image#
Field |
Default |
Description |
|---|---|---|
|
|
Container image repository. |
|
|
Container image tag. |
|
|
|
|
– |
Image pull secret references. |
Server#
Field |
Default |
Description |
|---|---|---|
|
|
ZMQ listening port (1024–65535). |
|
|
Token chunk size. |
|
|
Worker threads for ZMQ requests. |
|
|
|
L1 Cache#
Field |
Default |
Description |
|---|---|---|
|
required |
L1 cache size in GB. Must be > 0. |
Eviction#
Field |
Default |
Description |
|---|---|---|
|
|
Only |
|
|
Usage ratio (0.0–1.0] to trigger eviction. |
|
|
Fraction to evict (0.0–1.0]. |
Prometheus#
Field |
Default |
Description |
|---|---|---|
|
|
Expose Prometheus metrics. |
|
|
|
|
|
Create a ServiceMonitor CR. |
|
|
Scrape interval. |
|
– |
Extra labels on the ServiceMonitor. |
L2 Storage#
Field |
Default |
Description |
|---|---|---|
|
– |
List of L2 backends ( |
Scheduling#
Field |
Default |
Description |
|---|---|---|
|
GPU nodes |
Defaults to |
|
– |
Pod affinity rules. |
|
– |
Pod tolerations. |
|
– |
Priority class for pods. |
Overrides & Extras#
Field |
Default |
Description |
|---|---|---|
|
|
|
|
– |
Override auto-computed resources. |
|
– |
Extra environment variables. |
|
– |
Extra volumes. |
|
– |
Extra volume mounts. |
|
– |
Extra pod annotations. |
|
– |
Extra pod labels. |
|
– |
ServiceAccount for pods. |
|
– |
Extra CLI flags (appended last, can override). |
Auto-Computed Resources#
When spec.resourceOverrides is not set, the operator derives resources from
l1.sizeGB:
CPU request:
4coresMemory request:
ceil(l1.sizeGB + 5)GiMemory limit:
ceil(memoryRequest * 1.5)Gi
For example, l1.sizeGB: 60 produces a 65 Gi request and 98 Gi limit.
Auto-Injected Pod Settings#
The operator always injects these into the pod spec (they are not configurable via the CRD):
hostIPC: true – Required for CUDA IPC between LMCache and vLLM.
–host 0.0.0.0 – Binds the server to all interfaces so the node-local Service can route to it.
NVIDIA_VISIBLE_DEVICES=all – Ensures GPU access for IPC-based memory transfers.
TCP socket probes – Startup (5s initial, 30 failures), liveness (10s), and readiness (5s) probes on the server port.
Note
The operator does not mount an emptyDir at /dev/shm. With
hostIPC: true, the container sees the host’s /dev/shm directly.
Mounting an emptyDir would shadow it with a private tmpfs and break CUDA IPC.
Resources Created#
For an LMCacheEngine named my-cache:
Resource |
Name |
Purpose |
|---|---|---|
DaemonSet |
|
Runs LMCache server pods. |
Service (ClusterIP) |
|
Node-local discovery ( |
Service (headless) |
|
Prometheus scrape target. |
ConfigMap |
|
|
ServiceMonitor |
|
Prometheus Operator integration (when enabled). |
The connection ConfigMap contains:
{
"kv_connector": "LMCacheMPConnector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"lmcache.mp.host": "tcp://my-cache.default.svc.cluster.local",
"lmcache.mp.port": "5555"
}
}
Status & Conditions#
kubectl describe lmc my-cache
The status section includes:
phase:
Pending,Running,Degraded, orFailed.readyInstances / desiredInstances: Instance counts.
endpoints: Per-node connection info (node name, host IP, pod name, port, readiness).
conditions:
Available– At least one instance is ready.AllInstancesReady– All desired instances are ready.ConfigValid– Spec validation passed.
Validation Rules#
The operator validates the CR spec at apply time:
Field |
Rule |
|---|---|
|
Required, must be > 0. |
|
Must be |
|
Must be in (0.0, 1.0]. |
|
Must be in (0.0, 1.0]. |
|
Must be in [1024, 65535]. |
Examples#
Target Only GPU Nodes#
Use nodeSelector to run LMCache only on GPU nodes. New GPU nodes
automatically get an LMCache pod:
apiVersion: lmcache.lmcache.ai/v1alpha1
kind: LMCacheEngine
metadata:
name: my-cache
spec:
nodeSelector:
nvidia.com/gpu.present: "true"
l1:
sizeGB: 60
Note
The operator defaults nodeSelector to nvidia.com/gpu.present: "true"
when not specified, so a minimal CR already targets GPU nodes.
Custom Server Port#
If the default port (5555) conflicts with other services:
apiVersion: lmcache.lmcache.ai/v1alpha1
kind: LMCacheEngine
metadata:
name: my-cache
spec:
server:
port: 6555
l1:
sizeGB: 60
The connection ConfigMap updates automatically – vLLM pods pick up the new port on restart.
Production with Prometheus Monitoring#
apiVersion: lmcache.lmcache.ai/v1alpha1
kind: LMCacheEngine
metadata:
name: production-cache
namespace: llm-serving
spec:
nodeSelector:
nvidia.com/gpu.present: "true"
image:
repository: lmcache/standalone
tag: v0.1.0
server:
port: 6555
chunkSize: 256
maxWorkers: 4
l1:
sizeGB: 60
eviction:
triggerWatermark: 0.8
evictionRatio: 0.2
prometheus:
enabled: true
port: 9090
serviceMonitor:
enabled: true
labels:
release: kube-prometheus-stack
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
priorityClassName: system-node-critical
See Observability for metric names and Grafana configuration.
Override Auto-Computed Resources#
apiVersion: lmcache.lmcache.ai/v1alpha1
kind: LMCacheEngine
metadata:
name: my-cache
spec:
l1:
sizeGB: 60
resourceOverrides:
requests:
memory: "70Gi"
cpu: "8"
limits:
memory: "100Gi"
Operator vs Manual Deployment#
Concern |
Manual DaemonSet |
LMCacheEngine Operator |
|---|---|---|
hostIPC |
Must set manually |
Auto-injected |
|
Must set manually |
Auto-injected |
Service discovery |
|
Node-local ClusterIP Service + ConfigMap |
vLLM config |
Copy JSON into Deployment |
Mount |
Resource sizing |
Manual calculation |
Auto-computed from |
Prometheus |
Manual ServiceMonitor |
|
Validation |
Runtime errors only |
|
New GPU nodes |
DaemonSet handles it |
DaemonSet handles it (same) |
Security Considerations#
hostIPC exposes the host’s IPC namespace (System V IPC, POSIX message queues) to the container. Any process in the container can interact with IPC resources from other processes on the same host.
Deploy only in trusted environments.
Clusters using Pod Security Standards must allow the
privilegedprofile for the LMCache namespace – thebaselineandrestrictedprofiles rejecthostIPC.
Development#
make generate # Generate DeepCopy methods
make manifests # Generate CRD YAML + RBAC
make build # Compile operator binary
make fmt # go fmt
make vet # go vet
make test # Run unit tests
make lint # Run golangci-lint
Pushing a custom operator image:
# Docker Hub
make docker-build docker-push IMG=docker.io/<your-user>/lmcache-operator:latest
make deploy IMG=docker.io/<your-user>/lmcache-operator:latest
# Multi-platform (amd64 + arm64)
make docker-buildx IMG=<your-registry>/lmcache-operator:latest
If your cluster needs pull credentials:
kubectl create secret docker-registry regcred \
--docker-server=<your-registry> \
--docker-username=<username> \
--docker-password=<password> \
-n lmcache-operator-system