Kubernetes Operator#
The LMCache Kubernetes operator automates the deployment and lifecycle
management of LMCache multiprocess servers. Instead of hand-writing
DaemonSets, Services, and ConfigMaps (as described in the manual
Deployment Guide guide), you declare a single LMCacheEngine custom
resource and the operator reconciles all underlying Kubernetes objects.
Why Use the Operator#
The manual DaemonSet approach works, but it has sharp edges the operator eliminates:
Auto-injected pod settings – The operator always sets
hostIPC: trueand--host 0.0.0.0. ForgettinghostIPCin a hand-written manifest causes silent CUDA IPC failures (cudaErrorMapBufferObjectFailed) that are hard to debug.Node-local service discovery – The operator creates a ClusterIP Service with
internalTrafficPolicy=Localand a connection ConfigMap that vLLM pods simply mount. NohostNetwork, no Downward API, no shell variable substitution.Auto-computed resource sizing – Memory requests and limits are derived from
l1.sizeGB, avoiding OOM kills (under-provisioned) or wasted node capacity (over-provisioned).Declarative Prometheus integration – Set
prometheus.serviceMonitor.enabled: trueand the operator creates aServiceMonitorCR that the Prometheus Operator discovers automatically.CRD validation – OpenAPI schema validation catches misconfigurations (e.g.,
l1.sizeGB <= 0, invalid port range) atkubectl applytime, before any pods are created.
Prerequisites#
Kubernetes 1.20+
kubectlconfigured to access your cluster(Optional) Prometheus Operator for ServiceMonitor support
Installing the Operator#
Option A: One-line install from release (recommended)
# Latest stable release
kubectl apply -f https://github.com/LMCache/LMCache/releases/download/operator-latest/install.yaml
# Or nightly build from the dev branch
kubectl apply -f https://github.com/LMCache/LMCache/releases/download/operator-nightly-latest/install.yaml
Option B: Build from source
cd operator
make build
make install
make deploy IMG=<your-registry>/lmcache-operator:latest
Deploying an LMCacheEngine#
A minimal CR deploys a DaemonSet with 60 GB L1 cache on every GPU node:
apiVersion: lmcache.lmcache.ai/v1alpha1
kind: LMCacheEngine
metadata:
name: my-cache
spec:
l1:
sizeGB: 60
kubectl apply -f lmcache-engine.yaml
The operator automatically:
Creates a DaemonSet running one LMCache server pod per matched node
Sets
hostIPC: trueand passes--host 0.0.0.0to the serverCreates a node-local ClusterIP Service for vLLM discovery
Creates a connection ConfigMap (
my-cache-connection) with thekv-transfer-configJSON that vLLM needsAuto-computes resource requests/limits from the L1 cache size
Defaults
nodeSelectortonvidia.com/gpu.present: "true"
Note
The operator defaults the container image to lmcache/vllm-openai:latest.
Override with spec.image.repository and spec.image.tag to pin a
specific version.
Connecting vLLM#
The operator creates a ConfigMap named <engine-name>-connection containing
the kv-transfer-config JSON. Mount it in your vLLM Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
# Required for CUDA IPC between vLLM and LMCache
hostIPC: true
containers:
- name: vllm
image: lmcache/vllm-openai:latest
env:
# Deterministic hashing required by LMCache
- name: PYTHONHASHSEED
value: "0"
command: ["/bin/sh", "-c"]
args:
- |
exec python3 -m vllm.entrypoints.openai.api_server \
--model <your-model> \
--port 8000 \
--gpu-memory-utilization 0.8 \
--kv-transfer-config "$(cat /etc/lmcache/kv-transfer-config.json)"
ports:
- name: http
containerPort: 8000
volumeMounts:
- name: kv-transfer-config
mountPath: /etc/lmcache
readOnly: true
resources:
limits:
nvidia.com/gpu: "1"
volumes:
- name: kv-transfer-config
configMap:
name: my-cache-connection # <engine-name>-connection
Key requirements for vLLM pods:
hostIPC: true – CUDA IPC (
cudaIpcOpenMemHandle) needs a shared IPC namespace between vLLM and LMCache.PYTHONHASHSEED=0 – Ensures deterministic token hashing so vLLM and LMCache produce consistent cache keys.
ConfigMap mount – The
$(cat ...)pattern reads the connection JSON inline. The ConfigMap name is always<LMCacheEngine name>-connection.No hostNetwork needed – The operator’s node-local Service handles routing via
internalTrafficPolicy=Local.
Verifying the Deployment#
# Check LMCacheEngine status
kubectl get lmc
Expected output:
NAME PHASE READY DESIRED AGE
my-cache Running 3 3 5m
# Check the connection ConfigMap
kubectl get configmap my-cache-connection -o yaml
# Check LMCache pods
kubectl get pods -l app.kubernetes.io/managed-by=lmcache-operator
# Check detailed status with endpoints
kubectl describe lmc my-cache
CRD Spec Reference#
Image#
Field |
Default |
Description |
|---|---|---|
|
|
Container image repository. |
|
|
Container image tag. |
|
|
|
|
– |
Image pull secret references. |
Server#
Field |
Default |
Description |
|---|---|---|
|
|
ZMQ listening port (1024–65535). |
|
|
Token chunk size. |
|
|
Worker threads for ZMQ requests. |
|
|
|
L1 Cache#
Field |
Default |
Description |
|---|---|---|
|
required |
L1 cache size in GB. Must be > 0. |
Eviction#
Field |
Default |
Description |
|---|---|---|
|
|
Only |
|
|
Usage ratio (0.0–1.0] to trigger eviction. |
|
|
Fraction to evict (0.0–1.0]. |
Prometheus#
Field |
Default |
Description |
|---|---|---|
|
|
Expose Prometheus metrics. |
|
|
|
|
|
Create a ServiceMonitor CR. |
|
|
Scrape interval. |
|
– |
Extra labels on the ServiceMonitor. |
L2 Storage#
Field |
Default |
Description |
|---|---|---|
|
– |
List of L2 backends ( |
Scheduling#
Field |
Default |
Description |
|---|---|---|
|
GPU nodes |
Defaults to |
|
– |
Pod affinity rules. |
|
– |
Pod tolerations. |
|
– |
Priority class for pods. |
Overrides & Extras#
Field |
Default |
Description |
|---|---|---|
|
|
|
|
– |
Override auto-computed resources. |
|
– |
Extra environment variables. |
|
– |
Extra volumes. |
|
– |
Extra volume mounts. |
|
– |
Extra pod annotations. |
|
– |
Extra pod labels. |
|
– |
ServiceAccount for pods. |
|
– |
Extra CLI flags (appended last, can override). |
Auto-Computed Resources#
When spec.resourceOverrides is not set, the operator derives resources from
l1.sizeGB:
CPU request:
4coresMemory request:
ceil(l1.sizeGB + 5)GiMemory limit:
ceil(memoryRequest * 1.5)Gi
For example, l1.sizeGB: 60 produces a 65 Gi request and 98 Gi limit.
Auto-Injected Pod Settings#
The operator always injects these into the pod spec (they are not configurable via the CRD):
hostIPC: true – Required for CUDA IPC between LMCache and vLLM.
–host 0.0.0.0 – Binds the server to all interfaces so the node-local Service can route to it.
NVIDIA_VISIBLE_DEVICES=all – Ensures GPU access for IPC-based memory transfers.
TCP socket probes – Startup (5s initial, 30 failures), liveness (10s), and readiness (5s) probes on the server port.
Note
The operator does not mount an emptyDir at /dev/shm. With
hostIPC: true, the container sees the host’s /dev/shm directly.
Mounting an emptyDir would shadow it with a private tmpfs and break CUDA IPC.
Resources Created#
For an LMCacheEngine named my-cache:
Resource |
Name |
Purpose |
|---|---|---|
DaemonSet |
|
Runs LMCache server pods. |
Service (ClusterIP) |
|
Node-local discovery ( |
Service (headless) |
|
Prometheus scrape target. |
ConfigMap |
|
|
ServiceMonitor |
|
Prometheus Operator integration (when enabled). |
The connection ConfigMap contains:
{
"kv_connector": "LMCacheMPConnector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"lmcache.mp.host": "tcp://my-cache.default.svc.cluster.local",
"lmcache.mp.port": "5555"
}
}
Status & Conditions#
kubectl describe lmc my-cache
The status section includes:
phase:
Pending,Running,Degraded, orFailed.readyInstances / desiredInstances: Instance counts.
endpoints: Per-node connection info (node name, host IP, pod name, port, readiness).
conditions:
Available– At least one instance is ready.AllInstancesReady– All desired instances are ready.ConfigValid– Spec validation passed.
Validation Rules#
The operator validates the CR spec at apply time:
Field |
Rule |
|---|---|
|
Required, must be > 0. |
|
Must be |
|
Must be in (0.0, 1.0]. |
|
Must be in (0.0, 1.0]. |
|
Must be in [1024, 65535]. |
Examples#
Target Only GPU Nodes#
Use nodeSelector to run LMCache only on GPU nodes. New GPU nodes
automatically get an LMCache pod:
apiVersion: lmcache.lmcache.ai/v1alpha1
kind: LMCacheEngine
metadata:
name: my-cache
spec:
nodeSelector:
nvidia.com/gpu.present: "true"
l1:
sizeGB: 60
Note
The operator defaults nodeSelector to nvidia.com/gpu.present: "true"
when not specified, so a minimal CR already targets GPU nodes.
Custom Server Port#
If the default port (5555) conflicts with other services:
apiVersion: lmcache.lmcache.ai/v1alpha1
kind: LMCacheEngine
metadata:
name: my-cache
spec:
server:
port: 6555
l1:
sizeGB: 60
The connection ConfigMap updates automatically – vLLM pods pick up the new port on restart.
Production with Prometheus Monitoring#
apiVersion: lmcache.lmcache.ai/v1alpha1
kind: LMCacheEngine
metadata:
name: production-cache
namespace: llm-serving
spec:
nodeSelector:
nvidia.com/gpu.present: "true"
image:
repository: lmcache/standalone
tag: v0.1.0
server:
port: 6555
chunkSize: 256
maxWorkers: 4
l1:
sizeGB: 60
eviction:
triggerWatermark: 0.8
evictionRatio: 0.2
prometheus:
enabled: true
port: 9090
serviceMonitor:
enabled: true
labels:
release: kube-prometheus-stack
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
priorityClassName: system-node-critical
See Observability for metric names and Grafana configuration.
Override Auto-Computed Resources#
apiVersion: lmcache.lmcache.ai/v1alpha1
kind: LMCacheEngine
metadata:
name: my-cache
spec:
l1:
sizeGB: 60
resourceOverrides:
requests:
memory: "70Gi"
cpu: "8"
limits:
memory: "100Gi"
CacheBlend#
CacheBlend reuses cached KV at shifted (non-prefix) positions by recomputing a
small subset of tokens. The operator manages it as a second CRD,
CacheBlendEngine, plus a mutating admission webhook that injects the
pure-Python lmcache-cacheblend vLLM plugin into your serving pods – so you
do not rebuild the vLLM image. See Blending
for the technique itself.
It has two halves the operator runs together:
a GPU-resident CacheBlend V3 engine (
lmcache server --engine-type blend), deployed as a DaemonSet with the same GPU model asLMCacheEngine(privileged+runtimeClassName: nvidia+NVIDIA_VISIBLE_DEVICES=all+hostIPC, and nonvidia.com/gpuclaim) so it shares the vLLM GPU for same-device CUDA IPC; andthe vLLM-side plugin, injected into opted-in pods by the webhook.
Additional Prerequisites#
Beyond the operator prerequisites above:
cert-manager – the webhook’s serving certificate is issued by a cert-manager
Issuer+Certificate. Install it beforemake deploy:kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml kubectl -n cert-manager wait --for=condition=Available deploy --all --timeout=180sDeploy with the webhook – use
make deploy(notmake run, which is controller-only and disables the webhook viaENABLE_WEBHOOKS=false).Pod Security Standards – the webhook injects
hostIPC/privileged, which thebaseline/restrictedprofiles reject, so label the engine’s and the vLLM pod’s namespacespod-security.kubernetes.io/enforce=privileged.
Deploying a CacheBlendEngine#
apiVersion: lmcache.lmcache.ai/v1alpha1
kind: CacheBlendEngine
metadata:
name: my-cacheblend
spec:
l1:
sizeGB: 60
injection:
# The (private) cacheblend-plugin init-container image -- repository/tag/
# pullPolicy, like spec.image. Set repository to YOUR image; the
# inherited engine-image default is not a valid payload.
payloadImage:
repository: <registry>/cacheblend-plugin
tag: <tag>
# Appended to the vLLM pod so the private payload image can pull; the
# Secret must exist in the vLLM pod's namespace.
imagePullSecrets:
- name: my-registry-secret
The engine runs lmcache server --engine-type blend as a DaemonSet and
emits a my-cacheblend-connection ConfigMap with the CBKVConnector
kv-transfer-config (the operator wires the node-local Service host/port and
the cb.* tunables).
Opting a vLLM Pod In#
Label the pod template for the webhook and bind it to an engine by name. Launch
vLLM via the image ENTRYPOINT (args only) – a
command: ["/bin/sh", "-c", ...] wrapper is skipped, since appended args would
not reach vllm serve:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-cacheblend
spec:
replicas: 1
selector:
matchLabels:
app: vllm-cacheblend
template:
metadata:
labels:
app: vllm-cacheblend
lmcache.ai/cacheblend-inject: "true" # opt-in (webhook objectSelector)
annotations:
lmcache.ai/cacheblend-engine: "my-cacheblend" # bind to the engine
spec:
runtimeClassName: nvidia
containers:
- name: vllm
image: lmcache/vllm-openai:<pinned-tag>
args: ["<your-model>", "--port", "8000", "--gpu-memory-utilization", "0.8"]
resources:
limits:
nvidia.com/gpu: "1"
The webhook injects the plugin init container, PYTHONPATH, hostIPC, the
private-image pull secret, and the required CacheBlend vLLM flags
(--attention-backend CUSTOM, --kv-transfer-config from the engine’s
connection ConfigMap, --block-size 64, --pipeline-parallel-size 1,
--no-enable-chunked-prefill, --no-async-scheduling, --enforce-eager).
You supply only the model and your non-CacheBlend flags.
Verifying Injection#
The webhook mutates Pods, not the Deployment, so inspect a pod:
kubectl get pod -l app=vllm-cacheblend -o yaml | \
grep -E "initContainers|cb-plugin|PYTHONPATH|attention-backend|cacheblend-injected|skip-reason"
If nothing was injected, check the pod’s lmcache.ai/cacheblend-skip-reason
annotation: command-override (a sh -c wrapper was used),
kv-transfer-config-present (you set your own), engine-not-found (the
<name>-connection ConfigMap is missing), payload-image-unset (the
engine’s injection.payloadImage has no repository), or
target-container-not-found (the requested targetContainer /
cacheblend-container annotation names a container the pod does not have).
With failurePolicy: Ignore a
webhook/cert problem also leaves the pod un-mutated silently – confirm the
operator pod is Running and the MutatingWebhookConfiguration exists.
CacheBlendEngine Fields#
CacheBlendEngineSpec mirrors LMCacheEngineSpec (every field in the CRD
Spec Reference above) and adds:
Field |
Default |
Description |
|---|---|---|
|
|
Layer at which token importance is scored ( |
|
|
Fraction of non-prefix-hit tokens recomputed ( |
|
required |
The (private) cacheblend-plugin init-container image
( |
|
– |
Pull secrets appended to the vLLM pod for the private payload image. |
|
first container |
Name of the vLLM container to inject into. |
|
|
|
server.chunkSize defaults to 256 and must equal 256 (the blend matcher
requires chunk_size == vLLM --block-size * 4).
LMCacheCoordinator#
The LMCacheCoordinator CRD runs the mp coordinator – a fleet-wide HTTP
service that tracks mp server instances, evicts those whose heartbeats lapse,
performs L2 quota eviction, and hosts the global CacheBlend fingerprint
directory. It is a plain (non-GPU) Deployment exposed through a ClusterIP
Service; engines reach it via coordinator.ref or coordinator.url.
Deploying a Coordinator#
A ready-to-edit manifest lives at
config/samples/lmcache_v1alpha1_lmcachecoordinator.yaml in the operator
repo. A minimal coordinator:
apiVersion: lmcache.lmcache.ai/v1alpha1
kind: LMCacheCoordinator
metadata:
name: my-coordinator
spec:
port: 9300
kubectl get lmcc my-coordinator # shortName: lmcc
Connecting an Engine#
Point an LMCacheEngine / CacheBlendEngine at the coordinator through its
coordinator block. Use ref to name a coordinator in the same namespace
(the operator resolves it to the in-cluster Service URL), or url for an
explicit endpoint:
spec:
coordinator:
ref:
name: my-coordinator # or: url: http://my-coordinator.default.svc:9300
heartbeatInterval: 5 # seconds; must be > 0
l2EventReporting: false # report L2 store/lookup events for fleet eviction
Coordinator CRD Spec Reference#
Topology#
Field |
Default |
Description |
|---|---|---|
|
|
Coordinator pods. The registry is per-process in-memory, so >1 only makes sense behind a shared durable backend. Must be >= 0. |
|
shared engine image |
Runs the same lmcache binary as the engines. |
|
– |
Image pull secret references. |
HTTP Server#
Field |
Default |
Description |
|---|---|---|
|
|
Address the coordinator’s HTTP server binds to. |
|
|
HTTP port (1–65535). |
Membership & Health#
Field |
Default |
Description |
|---|---|---|
|
|
Seconds without a heartbeat after which an instance is evicted. Set
comfortably above the engines’ |
|
|
Seconds between health-check sweeps; |
L2 Quota Eviction#
Field |
Default |
Description |
|---|---|---|
|
|
Seconds between L2 eviction sweeps; |
|
|
Fraction of tracked keys (by count) to evict per cycle, [0.0, 1.0]. |
|
|
Usage fraction of the quota that fires eviction, (0.0, 1.0]. |
Global CacheBlend Directory#
Field |
Default |
Description |
|---|---|---|
|
|
Tokens per chunk for the global CacheBlend directory (the match unit). Must equal the LMCache chunk size the blend servers use. Must be > 0. |
|
|
Positions between match probes. |
Prometheus, Scheduling & Overrides#
Field |
Default |
Description |
|---|---|---|
|
|
Expose the metrics container port. See the note below. |
|
|
Metrics port. |
|
|
Create a ServiceMonitor CR (and headless metrics Service). |
|
|
Scrape interval. |
|
|
|
|
– |
Pod resource requests/limits (no auto-compute; the coordinator is CPU/memory light). |
|
– |
Pod scheduling controls. |
|
– |
Standard pod-shaping fields. |
|
– |
Extra CLI flags (appended last, can override any auto-generated flag). |
Note
The coordinator process does not yet expose a /metrics endpoint. The
Prometheus wiring is present for parity but is only useful once metrics are
added; serviceMonitor.enabled defaults to false.
Coordinator Resources Created#
For an LMCacheCoordinator named my-coordinator:
Resource |
Name |
Purpose |
|---|---|---|
Deployment |
|
Runs the coordinator HTTP server pods. |
Service (ClusterIP) |
|
Fleet-wide discovery on the HTTP port. |
Service (headless) |
|
Prometheus scrape target (when |
ServiceMonitor |
|
Prometheus Operator integration (when |
The status endpoint other components use to reach the coordinator is
http://<name>.<namespace>.svc:<port> (e.g.
http://my-coordinator.default.svc:9300).
Coordinator Status & Conditions#
The status section includes:
phase:
Pending,Running,Degraded, orFailed.replicas / readyReplicas: Pod counts from the Deployment.
endpoint: In-cluster URL for reaching the coordinator.
observedGeneration: Most recent reconciled generation.
conditions:
Available– At least one replica is ready.AllInstancesReady– All desired replicas are ready.ConfigValid– Spec validation passed.
Coordinator Validation Rules#
Field |
Rule |
|---|---|
|
Must be in [1, 65535]. |
|
Must be >= 0. |
|
Must be > 0. |
|
Must be >= 0. |
|
Must be in [0.0, 1.0]. |
|
Must be in (0.0, 1.0]. |
|
Must be > 0. |
Operator vs Manual Deployment#
Concern |
Manual DaemonSet |
LMCacheEngine Operator |
|---|---|---|
hostIPC |
Must set manually |
Auto-injected |
|
Must set manually |
Auto-injected |
Service discovery |
|
Node-local ClusterIP Service + ConfigMap |
vLLM config |
Copy JSON into Deployment |
Mount |
Resource sizing |
Manual calculation |
Auto-computed from |
Prometheus |
Manual ServiceMonitor |
|
Validation |
Runtime errors only |
|
New GPU nodes |
DaemonSet handles it |
DaemonSet handles it (same) |
Security Considerations#
hostIPC exposes the host’s IPC namespace (System V IPC, POSIX message queues) to the container. Any process in the container can interact with IPC resources from other processes on the same host.
Deploy only in trusted environments.
Clusters using Pod Security Standards must allow the
privilegedprofile for the LMCache namespace – thebaselineandrestrictedprofiles rejecthostIPC.
Development#
make generate # Generate DeepCopy methods
make manifests # Generate CRD YAML + RBAC
make build # Compile operator binary
make fmt # go fmt
make vet # go vet
make test # Run unit tests
make lint # Run golangci-lint
Pushing a custom operator image:
# Docker Hub
make docker-build docker-push IMG=docker.io/<your-user>/lmcache-operator:latest
make deploy IMG=docker.io/<your-user>/lmcache-operator:latest
# Multi-platform (amd64 + arm64)
make docker-buildx IMG=<your-registry>/lmcache-operator:latest
If your cluster needs pull credentials:
kubectl create secret docker-registry regcred \
--docker-server=<your-registry> \
--docker-username=<username> \
--docker-password=<password> \
-n lmcache-operator-system