Architecture Overview#
High-Level System Architecture#
LMCache extends an LLM inference engine (e.g., vLLM) with a multi-tier KV cache storage system spanning GPU memory, CPU memory, and disk/remote backends. The diagram below illustrates how KV cache blocks move across these layers.
Multi-Tier Storage Architecture
LMCache implements a hierarchical storage system with three distinct tiers:
GPU Memory: Holds the active working set of KV caches that are currently being used by the model
CPU DRAM: Acts as a “hot cache” for recently used KV chunks, using pinned memory for efficient GPU-CPU transfers
Local storage (e.g., local disk, NVMe GDS): Provides a large tier for local KV caching (e.g. for long documents)
Remote storage (e.g., Redis, Mooncake, InfiniStore): Persistent storage for KV caches. Reliable but not as performant as previous tiers.
Data Flow and Operations
When the model generates new key-value (KV) cache chunks on the GPU, LMCache can:
Offload overflow KV caches from GPU to CPU DRAM, freeing precious GPU memory
Asynchronously write KV caches from CPU to disk or remote storage using LRU eviction policies
Prefetch hot KV caches from disk/remote storage back to CPU when needed
On-demand reuse of cached segments by moving them from CPU back to GPU
This architecture enables LMCache to significantly reduce prefill delays and GPU memory pressure while maintaining high performance through intelligent cache management.
flowchart TB subgraph "LLM Engine (with LMCache Integration)" direction TB GPU["GPU Memory"] CPU["CPU DRAM"] GPU -- "Offload overflow KV" --> CPU CPU -- "On-demand reuse" --> GPU end Disk[(Disk Storage Backend)] Remote[(Remote Storage Backend)] CPU -- "Async write (LRU evict)" --> Disk CPU -- "Async upload" --> Remote Disk -- "Prefetch hot KV" --> CPU Remote -- "Fetch on reuse" --> CPU
Two modes#
- Storage Mode (KV cache offloading)
LMCache acts as a persistent KV store, optimizing for high reuse across queries or sessions. It offloads infrequently used KV blocks from GPU memory and persists popular caches across sessions, boosting cache hit rates for “hot” content. KV caches survive beyond single inference calls and even process restarts when backed by disk or external storage.
sequenceDiagram participant Main as LLM Inference Thread participant DiskTask as Disk Offload Task participant RemoteTask as Remote Offload Task Main->>Main: New KV chunk created (GPU memory) Main->>Main: Copy KV chunk to CPU buffer par Disk backend offload Main--)DiskTask: Spawn async disk write task DiskTask-->>DiskTask: Compress & save chunk to disk and Remote backend offload Main--)RemoteTask: Spawn async remote upload task RemoteTask-->>RemoteTask: Send chunk to remote store end Main-->>Main: Continue with next inference (no blocking)
- Transport Mode (Prefill-decode disaggregation)
Focuses on accelerating distributed inference by routing KV cache data between nodes in real-time. Enables prefill-decode disaggregation where one server computes KV for prompts and delivers them to another server for generation without recomputation. Uses peer-to-peer channels with communication libraries like NIXL for low-latency, high-bandwidth transfers.
Core Components#
- LLM Inference Engine Integration Module (Connector)
Integrated into the LLM engine (vLLM), the Connector taps into the paged KV memory manager. During prompt processing, it checks if token sequences were seen before:
Cache hit: Fetches precomputed KV cache chunks from LMCache, bypassing computation
Cache miss: Model computes KV as usual, then Connector hands newly-generated KV data to LMCache for storage
- Cache Index (Token Database)
Maintains an internal index mapping token sequences to cached KV entries and their locations. Enables cross-request and cross-instance cache lookups with configurable chunking strategy (default 256 tokens) and hashing scheme.
- Memory Object & Allocator
Manages KV cache entries as MemoryObj instances using a custom memory allocator within LocalCPUBackend. Ensures pinned memory for fast GPU↔CPU transfers, NUMA-aware allocation, and interfaces with eviction policies (LRU by default).
- Asynchronous Offloading
Offloading / loading the KV cache chunks in an asynchronous manner to avoid blocking inference threads and GPU cycles.
- Remote Connectors
Plugin-based system for remote backends (Redis, Mooncake, NiXL). Uses generic RemoteBackend wrapper that delegates operations to connector implementations, supporting dynamic loading of custom backends.
LMCache Controller#
The Controller provides a management API for runtime cache operations:
Lookup: Query cache entries for given token sequences and their locations
Clear: Purge KV cache entirely or for specific entries
Compress/Decompress: On-demand compression using CacheGen or decompression to full precision
Move: Migrate caches to specified locations for cache warming or optimization
Pin/Unpin: Mark cache entries as persistent to prevent eviction
Health & Finish Checks: Report worker health and confirm completion of async operations
The Controller coordinates with all LMCache workers in the system, providing centralized management for both single-instance and distributed deployments.