Multiprocess Mode#

LMCache multiprocess (MP) mode runs LMCache as a standalone service that vLLM instances connect to over ZMQ. One LMCache server per node can serve multiple vLLM pods, providing process isolation, shared caching, and independent resource scaling.

Key Benefits#

  • Process isolation – LMCache and vLLM run in separate processes (or containers), so a cache-related issue does not crash the inference engine.

  • No GIL contention or Python overhead on the inference path – By running LMCache in a separate process, its Python GIL and CPU work (hashing, memory management, L2 I/O) do not compete with vLLM’s inference threads.

  • Shared caching across pods – Multiple vLLM instances on the same node share a single L1 cache, maximizing KV reuse.

  • Independent resource scaling – Allocate CPU memory for caching independently of GPU memory for inference.

  • Multi-tier storage (L1 + L2) – In-memory L1 cache backed by persistent L2 storage via NIXL (GDS, POSIX, HF3FS, and more).

  • Built-in observability – Prometheus metrics and a telemetry event system out of the box.

Prerequisites#

  • vLLM latest version is recommended for best compatibility

  • LMCache latest dev branch

Server Variants#

LMCache ships three server entry points:

Entry Point

Description

python3 -m lmcache.v1.multiprocess.server

ZMQ-only server (default, production).

python3 -m lmcache.v1.multiprocess.blend_server

CacheBlend-enabled server for non-prefix KV reuse.

python3 -m lmcache.v1.multiprocess.http_server

ZMQ + FastAPI HTTP frontend (adds /api/healthcheck for K8s probes).