Kubernetes Operator#

The LMCache Kubernetes operator automates the deployment and lifecycle management of LMCache multiprocess servers. Instead of hand-writing DaemonSets, Services, and ConfigMaps (as described in the manual Deployment Guide guide), you declare a single LMCacheEngine custom resource and the operator reconciles all underlying Kubernetes objects.

Why Use the Operator#

The manual DaemonSet approach works, but it has sharp edges the operator eliminates:

  • Auto-injected pod settings – The operator always sets hostIPC: true and --host 0.0.0.0. Forgetting hostIPC in a hand-written manifest causes silent CUDA IPC failures (cudaErrorMapBufferObjectFailed) that are hard to debug.

  • Node-local service discovery – The operator creates a ClusterIP Service with internalTrafficPolicy=Local and a connection ConfigMap that vLLM pods simply mount. No hostNetwork, no Downward API, no shell variable substitution.

  • Auto-computed resource sizing – Memory requests and limits are derived from l1.sizeGB, avoiding OOM kills (under-provisioned) or wasted node capacity (over-provisioned).

  • Declarative Prometheus integration – Set prometheus.serviceMonitor.enabled: true and the operator creates a ServiceMonitor CR that the Prometheus Operator discovers automatically.

  • CRD validation – OpenAPI schema validation catches misconfigurations (e.g., l1.sizeGB <= 0, invalid port range) at kubectl apply time, before any pods are created.

Prerequisites#

  • Kubernetes 1.20+

  • kubectl configured to access your cluster

  • (Optional) Prometheus Operator for ServiceMonitor support

Installing the Operator#

Option A: One-line install from release (recommended)

# Latest stable release
kubectl apply -f https://github.com/LMCache/LMCache/releases/download/operator-latest/install.yaml

# Or nightly build from the dev branch
kubectl apply -f https://github.com/LMCache/LMCache/releases/download/operator-nightly-latest/install.yaml

Option B: Build from source

cd operator
make build
make install
make deploy IMG=<your-registry>/lmcache-operator:latest

Deploying an LMCacheEngine#

A minimal CR deploys a DaemonSet with 60 GB L1 cache on every GPU node:

apiVersion: lmcache.lmcache.ai/v1alpha1
kind: LMCacheEngine
metadata:
  name: my-cache
spec:
  l1:
    sizeGB: 60
kubectl apply -f lmcache-engine.yaml

The operator automatically:

  • Creates a DaemonSet running one LMCache server pod per matched node

  • Sets hostIPC: true and passes --host 0.0.0.0 to the server

  • Creates a node-local ClusterIP Service for vLLM discovery

  • Creates a connection ConfigMap (my-cache-connection) with the kv-transfer-config JSON that vLLM needs

  • Auto-computes resource requests/limits from the L1 cache size

  • Defaults nodeSelector to nvidia.com/gpu.present: "true"

Note

The operator defaults the container image to lmcache/vllm-openai:latest. Override with spec.image.repository and spec.image.tag to pin a specific version.

Connecting vLLM#

The operator creates a ConfigMap named <engine-name>-connection containing the kv-transfer-config JSON. Mount it in your vLLM Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      # Required for CUDA IPC between vLLM and LMCache
      hostIPC: true
      containers:
        - name: vllm
          image: lmcache/vllm-openai:latest
          env:
            # Deterministic hashing required by LMCache
            - name: PYTHONHASHSEED
              value: "0"
          command: ["/bin/sh", "-c"]
          args:
            - |
              exec python3 -m vllm.entrypoints.openai.api_server \
                --model <your-model> \
                --port 8000 \
                --gpu-memory-utilization 0.8 \
                --kv-transfer-config "$(cat /etc/lmcache/kv-transfer-config.json)"
          ports:
            - name: http
              containerPort: 8000
          volumeMounts:
            - name: kv-transfer-config
              mountPath: /etc/lmcache
              readOnly: true
          resources:
            limits:
              nvidia.com/gpu: "1"
      volumes:
        - name: kv-transfer-config
          configMap:
            name: my-cache-connection  # <engine-name>-connection

Key requirements for vLLM pods:

  • hostIPC: true – CUDA IPC (cudaIpcOpenMemHandle) needs a shared IPC namespace between vLLM and LMCache.

  • PYTHONHASHSEED=0 – Ensures deterministic token hashing so vLLM and LMCache produce consistent cache keys.

  • ConfigMap mount – The $(cat ...) pattern reads the connection JSON inline. The ConfigMap name is always <LMCacheEngine name>-connection.

  • No hostNetwork needed – The operator’s node-local Service handles routing via internalTrafficPolicy=Local.

Verifying the Deployment#

# Check LMCacheEngine status
kubectl get lmc

Expected output:

NAME       PHASE     READY   DESIRED   AGE
my-cache   Running   3       3         5m
# Check the connection ConfigMap
kubectl get configmap my-cache-connection -o yaml

# Check LMCache pods
kubectl get pods -l app.kubernetes.io/managed-by=lmcache-operator

# Check detailed status with endpoints
kubectl describe lmc my-cache

CRD Spec Reference#

Image#

Field

Default

Description

image.repository

lmcache/vllm-openai

Container image repository.

image.tag

latest

Container image tag.

image.pullPolicy

IfNotPresent

Always, Never, or IfNotPresent.

imagePullSecrets

Image pull secret references.

Server#

Field

Default

Description

server.port

5555

ZMQ listening port (1024–65535).

server.chunkSize

256

Token chunk size.

server.maxWorkers

1

Worker threads for ZMQ requests.

server.hashAlgorithm

blake3

builtin, sha256_cbor, or blake3.

L1 Cache#

Field

Default

Description

l1.sizeGB

required

L1 cache size in GB. Must be > 0.

Eviction#

Field

Default

Description

eviction.policy

LRU

Only LRU is supported.

eviction.triggerWatermark

0.8

Usage ratio (0.0–1.0] to trigger eviction.

eviction.evictionRatio

0.2

Fraction to evict (0.0–1.0].

Prometheus#

Field

Default

Description

prometheus.enabled

true

Expose Prometheus metrics.

prometheus.port

9090

/metrics endpoint port.

prometheus.serviceMonitor.enabled

false

Create a ServiceMonitor CR.

prometheus.serviceMonitor.interval

30s

Scrape interval.

prometheus.serviceMonitor.labels

Extra labels on the ServiceMonitor.

L2 Storage#

Field

Default

Description

l2Backends

List of L2 backends (type + config). See Secondary KV Storage.

Scheduling#

Field

Default

Description

nodeSelector

GPU nodes

Defaults to nvidia.com/gpu.present: "true".

affinity

Pod affinity rules.

tolerations

Pod tolerations.

priorityClassName

Priority class for pods.

Overrides & Extras#

Field

Default

Description

logLevel

INFO

DEBUG, INFO, WARNING, ERROR.

resourceOverrides

Override auto-computed resources.

env

Extra environment variables.

volumes

Extra volumes.

volumeMounts

Extra volume mounts.

podAnnotations

Extra pod annotations.

podLabels

Extra pod labels.

serviceAccountName

ServiceAccount for pods.

extraArgs

Extra CLI flags (appended last, can override).

Auto-Computed Resources#

When spec.resourceOverrides is not set, the operator derives resources from l1.sizeGB:

  • CPU request: 4 cores

  • Memory request: ceil(l1.sizeGB + 5) Gi

  • Memory limit: ceil(memoryRequest * 1.5) Gi

For example, l1.sizeGB: 60 produces a 65 Gi request and 98 Gi limit.

Auto-Injected Pod Settings#

The operator always injects these into the pod spec (they are not configurable via the CRD):

  • hostIPC: true – Required for CUDA IPC between LMCache and vLLM.

  • –host 0.0.0.0 – Binds the server to all interfaces so the node-local Service can route to it.

  • NVIDIA_VISIBLE_DEVICES=all – Ensures GPU access for IPC-based memory transfers.

  • TCP socket probes – Startup (5s initial, 30 failures), liveness (10s), and readiness (5s) probes on the server port.

Note

The operator does not mount an emptyDir at /dev/shm. With hostIPC: true, the container sees the host’s /dev/shm directly. Mounting an emptyDir would shadow it with a private tmpfs and break CUDA IPC.

Resources Created#

For an LMCacheEngine named my-cache:

Resource

Name

Purpose

DaemonSet

my-cache

Runs LMCache server pods.

Service (ClusterIP)

my-cache

Node-local discovery (internalTrafficPolicy=Local).

Service (headless)

my-cache-metrics

Prometheus scrape target.

ConfigMap

my-cache-connection

kv-transfer-config JSON for vLLM.

ServiceMonitor

my-cache

Prometheus Operator integration (when enabled).

The connection ConfigMap contains:

{
  "kv_connector": "LMCacheMPConnector",
  "kv_role": "kv_both",
  "kv_connector_extra_config": {
    "lmcache.mp.host": "tcp://my-cache.default.svc.cluster.local",
    "lmcache.mp.port": "5555"
  }
}

Status & Conditions#

kubectl describe lmc my-cache

The status section includes:

  • phase: Pending, Running, Degraded, or Failed.

  • readyInstances / desiredInstances: Instance counts.

  • endpoints: Per-node connection info (node name, host IP, pod name, port, readiness).

  • conditions:

    • Available – At least one instance is ready.

    • AllInstancesReady – All desired instances are ready.

    • ConfigValid – Spec validation passed.

Validation Rules#

The operator validates the CR spec at apply time:

Field

Rule

l1.sizeGB

Required, must be > 0.

eviction.policy

Must be LRU (if set).

eviction.triggerWatermark

Must be in (0.0, 1.0].

eviction.evictionRatio

Must be in (0.0, 1.0].

server.port

Must be in [1024, 65535].

Examples#

Target Only GPU Nodes#

Use nodeSelector to run LMCache only on GPU nodes. New GPU nodes automatically get an LMCache pod:

apiVersion: lmcache.lmcache.ai/v1alpha1
kind: LMCacheEngine
metadata:
  name: my-cache
spec:
  nodeSelector:
    nvidia.com/gpu.present: "true"
  l1:
    sizeGB: 60

Note

The operator defaults nodeSelector to nvidia.com/gpu.present: "true" when not specified, so a minimal CR already targets GPU nodes.

Custom Server Port#

If the default port (5555) conflicts with other services:

apiVersion: lmcache.lmcache.ai/v1alpha1
kind: LMCacheEngine
metadata:
  name: my-cache
spec:
  server:
    port: 6555
  l1:
    sizeGB: 60

The connection ConfigMap updates automatically – vLLM pods pick up the new port on restart.

Production with Prometheus Monitoring#

apiVersion: lmcache.lmcache.ai/v1alpha1
kind: LMCacheEngine
metadata:
  name: production-cache
  namespace: llm-serving
spec:
  nodeSelector:
    nvidia.com/gpu.present: "true"
  image:
    repository: lmcache/standalone
    tag: v0.1.0
  server:
    port: 6555
    chunkSize: 256
    maxWorkers: 4
  l1:
    sizeGB: 60
  eviction:
    triggerWatermark: 0.8
    evictionRatio: 0.2
  prometheus:
    enabled: true
    port: 9090
    serviceMonitor:
      enabled: true
      labels:
        release: kube-prometheus-stack
  podAnnotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
  priorityClassName: system-node-critical

See Observability for metric names and Grafana configuration.

Override Auto-Computed Resources#

apiVersion: lmcache.lmcache.ai/v1alpha1
kind: LMCacheEngine
metadata:
  name: my-cache
spec:
  l1:
    sizeGB: 60
  resourceOverrides:
    requests:
      memory: "70Gi"
      cpu: "8"
    limits:
      memory: "100Gi"

CacheBlend#

CacheBlend reuses cached KV at shifted (non-prefix) positions by recomputing a small subset of tokens. The operator manages it as a second CRD, CacheBlendEngine, plus a mutating admission webhook that injects the pure-Python lmcache-cacheblend vLLM plugin into your serving pods – so you do not rebuild the vLLM image. See Blending for the technique itself.

It has two halves the operator runs together:

  • a GPU-resident CacheBlend V3 engine (lmcache server --engine-type blend), deployed as a DaemonSet with the same GPU model as LMCacheEngine (privileged + runtimeClassName: nvidia + NVIDIA_VISIBLE_DEVICES=all + hostIPC, and no nvidia.com/gpu claim) so it shares the vLLM GPU for same-device CUDA IPC; and

  • the vLLM-side plugin, injected into opted-in pods by the webhook.

Additional Prerequisites#

Beyond the operator prerequisites above:

  • cert-manager – the webhook’s serving certificate is issued by a cert-manager Issuer + Certificate. Install it before make deploy:

    kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml
    kubectl -n cert-manager wait --for=condition=Available deploy --all --timeout=180s
    
  • Deploy with the webhook – use make deploy (not make run, which is controller-only and disables the webhook via ENABLE_WEBHOOKS=false).

  • Pod Security Standards – the webhook injects hostIPC/privileged, which the baseline/restricted profiles reject, so label the engine’s and the vLLM pod’s namespaces pod-security.kubernetes.io/enforce=privileged.

Deploying a CacheBlendEngine#

apiVersion: lmcache.lmcache.ai/v1alpha1
kind: CacheBlendEngine
metadata:
  name: my-cacheblend
spec:
  l1:
    sizeGB: 60
  injection:
    # The (private) cacheblend-plugin init-container image -- repository/tag/
    # pullPolicy, like spec.image.  Set repository to YOUR image; the
    # inherited engine-image default is not a valid payload.
    payloadImage:
      repository: <registry>/cacheblend-plugin
      tag: <tag>
    # Appended to the vLLM pod so the private payload image can pull; the
    # Secret must exist in the vLLM pod's namespace.
    imagePullSecrets:
      - name: my-registry-secret

The engine runs lmcache server --engine-type blend as a DaemonSet and emits a my-cacheblend-connection ConfigMap with the CBKVConnector kv-transfer-config (the operator wires the node-local Service host/port and the cb.* tunables).

Opting a vLLM Pod In#

Label the pod template for the webhook and bind it to an engine by name. Launch vLLM via the image ENTRYPOINT (args only) – a command: ["/bin/sh", "-c", ...] wrapper is skipped, since appended args would not reach vllm serve:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-cacheblend
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-cacheblend
  template:
    metadata:
      labels:
        app: vllm-cacheblend
        lmcache.ai/cacheblend-inject: "true"          # opt-in (webhook objectSelector)
      annotations:
        lmcache.ai/cacheblend-engine: "my-cacheblend" # bind to the engine
    spec:
      runtimeClassName: nvidia
      containers:
        - name: vllm
          image: lmcache/vllm-openai:<pinned-tag>
          args: ["<your-model>", "--port", "8000", "--gpu-memory-utilization", "0.8"]
          resources:
            limits:
              nvidia.com/gpu: "1"

The webhook injects the plugin init container, PYTHONPATH, hostIPC, the private-image pull secret, and the required CacheBlend vLLM flags (--attention-backend CUSTOM, --kv-transfer-config from the engine’s connection ConfigMap, --block-size 64, --pipeline-parallel-size 1, --no-enable-chunked-prefill, --no-async-scheduling, --enforce-eager). You supply only the model and your non-CacheBlend flags.

Verifying Injection#

The webhook mutates Pods, not the Deployment, so inspect a pod:

kubectl get pod -l app=vllm-cacheblend -o yaml | \
  grep -E "initContainers|cb-plugin|PYTHONPATH|attention-backend|cacheblend-injected|skip-reason"

If nothing was injected, check the pod’s lmcache.ai/cacheblend-skip-reason annotation: command-override (a sh -c wrapper was used), kv-transfer-config-present (you set your own), engine-not-found (the <name>-connection ConfigMap is missing), payload-image-unset (the engine’s injection.payloadImage has no repository), or target-container-not-found (the requested targetContainer / cacheblend-container annotation names a container the pod does not have). With failurePolicy: Ignore a webhook/cert problem also leaves the pod un-mutated silently – confirm the operator pod is Running and the MutatingWebhookConfiguration exists.

CacheBlendEngine Fields#

CacheBlendEngineSpec mirrors LMCacheEngineSpec (every field in the CRD Spec Reference above) and adds:

Field

Default

Description

blend.checkLayer

1

Layer at which token importance is scored (cb.check_layer).

blend.recompRatio

0.15

Fraction of non-prefix-hit tokens recomputed (cb.recomp_ratio).

injection.payloadImage

required

The (private) cacheblend-plugin init-container image (repository / tag / pullPolicy). Set repository – the inherited engine-image default is not a valid payload.

injection.imagePullSecrets

Pull secrets appended to the vLLM pod for the private payload image.

injection.targetContainer

first container

Name of the vLLM container to inject into.

injection.cudagraph

eager

eager | piecewise | full_decode_only (never full).

server.chunkSize defaults to 256 and must equal 256 (the blend matcher requires chunk_size == vLLM --block-size * 4).

LMCacheCoordinator#

The LMCacheCoordinator CRD runs the mp coordinator – a fleet-wide HTTP service that tracks mp server instances, evicts those whose heartbeats lapse, performs L2 quota eviction, and hosts the global CacheBlend fingerprint directory. It is a plain (non-GPU) Deployment exposed through a ClusterIP Service; engines reach it via coordinator.ref or coordinator.url.

Deploying a Coordinator#

A ready-to-edit manifest lives at config/samples/lmcache_v1alpha1_lmcachecoordinator.yaml in the operator repo. A minimal coordinator:

apiVersion: lmcache.lmcache.ai/v1alpha1
kind: LMCacheCoordinator
metadata:
  name: my-coordinator
spec:
  port: 9300
kubectl get lmcc my-coordinator   # shortName: lmcc

Connecting an Engine#

Point an LMCacheEngine / CacheBlendEngine at the coordinator through its coordinator block. Use ref to name a coordinator in the same namespace (the operator resolves it to the in-cluster Service URL), or url for an explicit endpoint:

spec:
  coordinator:
    ref:
      name: my-coordinator       # or: url: http://my-coordinator.default.svc:9300
    heartbeatInterval: 5          # seconds; must be > 0
    l2EventReporting: false       # report L2 store/lookup events for fleet eviction

Coordinator CRD Spec Reference#

Topology#

Field

Default

Description

replicas

1

Coordinator pods. The registry is per-process in-memory, so >1 only makes sense behind a shared durable backend. Must be >= 0.

image.repository / image.tag / image.pullPolicy

shared engine image

Runs the same lmcache binary as the engines.

imagePullSecrets

Image pull secret references.

HTTP Server#

Field

Default

Description

host

0.0.0.0

Address the coordinator’s HTTP server binds to.

port

9300

HTTP port (1–65535).

Membership & Health#

Field

Default

Description

instanceTimeout

30

Seconds without a heartbeat after which an instance is evicted. Set comfortably above the engines’ coordinator.heartbeatInterval.

healthCheckInterval

10

Seconds between health-check sweeps; 0 disables the loop.

L2 Quota Eviction#

Field

Default

Description

evictionCheckInterval

5

Seconds between L2 eviction sweeps; 0 disables the loop.

evictionRatio

0.2

Fraction of tracked keys (by count) to evict per cycle, [0.0, 1.0].

triggerWatermark

1.0

Usage fraction of the quota that fires eviction, (0.0, 1.0].

Global CacheBlend Directory#

Field

Default

Description

blendChunkSize

256

Tokens per chunk for the global CacheBlend directory (the match unit). Must equal the LMCache chunk size the blend servers use. Must be > 0.

blendProbeStride

1

Positions between match probes. 1 probes every offset for full recall; raise it to trade recall for coordinator CPU. Must be > 0.

Prometheus, Scheduling & Overrides#

Field

Default

Description

prometheus.enabled

true

Expose the metrics container port. See the note below.

prometheus.port

9090

Metrics port.

prometheus.serviceMonitor.enabled

false

Create a ServiceMonitor CR (and headless metrics Service).

prometheus.serviceMonitor.interval

30s

Scrape interval.

logLevel

INFO

DEBUG | INFO | WARNING | ERROR.

resourceOverrides

Pod resource requests/limits (no auto-compute; the coordinator is CPU/memory light).

nodeSelector / affinity / tolerations / priorityClassName

Pod scheduling controls.

env / volumes / volumeMounts / podAnnotations / podLabels / serviceAccountName

Standard pod-shaping fields.

extraArgs

Extra CLI flags (appended last, can override any auto-generated flag).

Note

The coordinator process does not yet expose a /metrics endpoint. The Prometheus wiring is present for parity but is only useful once metrics are added; serviceMonitor.enabled defaults to false.

Coordinator Resources Created#

For an LMCacheCoordinator named my-coordinator:

Resource

Name

Purpose

Deployment

my-coordinator

Runs the coordinator HTTP server pods.

Service (ClusterIP)

my-coordinator

Fleet-wide discovery on the HTTP port.

Service (headless)

my-coordinator-metrics

Prometheus scrape target (when serviceMonitor.enabled).

ServiceMonitor

my-coordinator

Prometheus Operator integration (when serviceMonitor.enabled).

The status endpoint other components use to reach the coordinator is http://<name>.<namespace>.svc:<port> (e.g. http://my-coordinator.default.svc:9300).

Coordinator Status & Conditions#

The status section includes:

  • phase: Pending, Running, Degraded, or Failed.

  • replicas / readyReplicas: Pod counts from the Deployment.

  • endpoint: In-cluster URL for reaching the coordinator.

  • observedGeneration: Most recent reconciled generation.

  • conditions:

    • Available – At least one replica is ready.

    • AllInstancesReady – All desired replicas are ready.

    • ConfigValid – Spec validation passed.

Coordinator Validation Rules#

Field

Rule

port

Must be in [1, 65535].

replicas

Must be >= 0.

instanceTimeout

Must be > 0.

healthCheckInterval / evictionCheckInterval

Must be >= 0.

evictionRatio

Must be in [0.0, 1.0].

triggerWatermark

Must be in (0.0, 1.0].

blendChunkSize / blendProbeStride

Must be > 0.

Operator vs Manual Deployment#

Concern

Manual DaemonSet

LMCacheEngine Operator

hostIPC

Must set manually

Auto-injected

--host 0.0.0.0

Must set manually

Auto-injected

Service discovery

hostNetwork + status.hostIP

Node-local ClusterIP Service + ConfigMap

vLLM config

Copy JSON into Deployment

Mount <name>-connection ConfigMap

Resource sizing

Manual calculation

Auto-computed from l1.sizeGB

Prometheus

Manual ServiceMonitor

serviceMonitor.enabled: true

Validation

Runtime errors only

kubectl apply rejects invalid specs

New GPU nodes

DaemonSet handles it

DaemonSet handles it (same)

Security Considerations#

hostIPC exposes the host’s IPC namespace (System V IPC, POSIX message queues) to the container. Any process in the container can interact with IPC resources from other processes on the same host.

  • Deploy only in trusted environments.

  • Clusters using Pod Security Standards must allow the privileged profile for the LMCache namespace – the baseline and restricted profiles reject hostIPC.

Development#

make generate     # Generate DeepCopy methods
make manifests    # Generate CRD YAML + RBAC
make build        # Compile operator binary
make fmt          # go fmt
make vet          # go vet
make test         # Run unit tests
make lint         # Run golangci-lint

Pushing a custom operator image:

# Docker Hub
make docker-build docker-push IMG=docker.io/<your-user>/lmcache-operator:latest
make deploy IMG=docker.io/<your-user>/lmcache-operator:latest

# Multi-platform (amd64 + arm64)
make docker-buildx IMG=<your-registry>/lmcache-operator:latest

If your cluster needs pull credentials:

kubectl create secret docker-registry regcred \
  --docker-server=<your-registry> \
  --docker-username=<username> \
  --docker-password=<password> \
  -n lmcache-operator-system