Layerwise KV Transfer#
The storage and loading of KV Cache on a layer granularity is a key optimization that allows for forward pass to “stagger” through its computation as each layer’s KV Cache is received instead of only waiting to begin after the entire loading
CacheBlend is implemented on top of the layerwise codepath in order to pipeline recompute and loading to mask the latency of loading KV Cache.
Architecture Overview#
- CacheEngine
The main orchestrator containing two primary generators:
Retrieval Generator (N + 2 yields): Handles layer-by-layer KV cache loading with on-demand memory allocation
Storage Generator (N + 1 yields): Manages layer-by-layer KV cache saving with upfront CPU memory allocation
- LayerwiseGPUConnector
Manages GPU-CPU memory transfers with dedicated CUDA streams:
Load GPU Buffer: Temporary GPU memory for CPU→GPU transfers (
use_gpu: true
)Store GPU Buffer: Temporary GPU memory for GPU→CPU transfers (
use_gpu: true
)Nested Generators:
batched_to_gpu()
andbatched_from_gpu()
handle actual memory operations
- StorageManager
Handles persistent storage operations:
layerwise_batched_get()
: Asynchronous retrieval with.result()
for request-level concurrencybatched_put()
: Stores memory objects to persistent backends
Execution Flow#
The layerwise pipeline follows a numbered execution sequence:
- 1. start_load_kv()
Initializes Retrieval Generator via
lmcache_engine.retrieve_layer()
Performs setup (1st
next()
) and loads layer 0 (2ndnext()
)Creates
layerwise_retrievers
list for ongoing layer processing
- 2. wait_for_layer_load() (repeated for each layer)
Advances Retrieval Generator via
next()
to process layer iTriggers
StorageManager.layerwise_batched_get()
for async cache retrievalCalls GPU Load Generator’s
batched_to_gpu()
to transfer memory objects to GPULast request in batch: Synchronizes
current_stream.wait_stream(load_stream)
- 3. save_kv_layer() (repeated for each layer)
First call only: Creates Storage Generator with upfront CPU memory allocation
Advances Storage Generator via
next()
to process layer iCalls GPU Store Generator’s
batched_from_gpu()
to transfer GPU data to CPUFirst request in batch: Synchronizes
store_stream.wait_stream(current_stream)
- 4. wait_for_save()
Finalizes Storage Generator with last
next()
callCompletes all
StorageManager.batched_put()
operationsPerforms GPU Store Generator cleanup
Key Optimizations#
- Pipelined Memory Operations
The system overlaps layer N+1 computation with layer N storage.
- Stream Synchronization
Three CUDA streams coordinate operations:
current_stream
: vLLM’s forward pass computationload_stream
: KV cache loading operationsstore_stream
: KV cache storing operations
- Batch-Level Coordination
Multiple requests are processed together with specialized synchronization:
First request: Provides store stream synchronization to prevent GPU buffer corruption
Last request: Provides load stream synchronization to ensure KV cache availability
- Memory Allocation Strategies
Retrieval: Layer-by-layer allocation
Storage: Upfront allocation for all layers
- Cache Key Management
Multi-layer cache engine keys use
split_layers(N)
to create per-layer kubernetes_deployment
Configuration#
Enable layerwise caching by setting:
use_layerwise: true
The system automatically selects appropriate layerwise GPU connectors based on configuration:
VLLMPagedMemLayerwiseGPUConnector
: For standard layerwise operationsVLLMBufferLayerwiseGPUConnector
: When blending is enabled