Layerwise KV Transfer#

The storage and loading of KV Cache on a layer granularity is a key optimization that allows for forward pass to “stagger” through its computation as each layer’s KV Cache is received instead of only waiting to begin after the entire loading

CacheBlend is implemented on top of the layerwise codepath in order to pipeline recompute and loading to mask the latency of loading KV Cache.

Click to open full-size

Architecture Overview#

CacheEngine

The main orchestrator containing two primary generators:

Retrieval Generator (N + 2 yields): Handles layer-by-layer KV cache loading with on-demand memory allocation
Storage Generator (N + 1 yields): Manages layer-by-layer KV cache saving with upfront CPU memory allocation

LayerwiseGPUConnector

Manages GPU-CPU memory transfers with dedicated CUDA streams:

Load GPU Buffer: Temporary GPU memory for CPU→GPU transfers (use_gpu: true)
Store GPU Buffer: Temporary GPU memory for GPU→CPU transfers (use_gpu: true)
Nested Generators: batched_to_gpu() and batched_from_gpu() handle actual memory operations

StorageManager

Handles persistent storage operations:

layerwise_batched_get(): Asynchronous retrieval with .result() for request-level concurrency
batched_put(): Stores memory objects to persistent backends

Execution Flow#

The layerwise pipeline follows a numbered execution sequence:

1. start_load_kv()

Initializes Retrieval Generator via lmcache_engine.retrieve_layer()
Performs setup (1st next()) and loads layer 0 (2nd next())
Creates layerwise_retrievers list for ongoing layer processing

2. wait_for_layer_load() (repeated for each layer)

Advances Retrieval Generator via next() to process layer i
Triggers StorageManager.layerwise_batched_get() for async cache retrieval
Calls GPU Load Generator’s batched_to_gpu() to transfer memory objects to GPU
Last request in batch: Synchronizes current_stream.wait_stream(load_stream)

3. save_kv_layer() (repeated for each layer)

First call only: Creates Storage Generator with upfront CPU memory allocation
Advances Storage Generator via next() to process layer i
Calls GPU Store Generator’s batched_from_gpu() to transfer GPU data to CPU
First request in batch: Synchronizes store_stream.wait_stream(current_stream)

4. wait_for_save()

Finalizes Storage Generator with last next() call
Completes all StorageManager.batched_put() operations
Performs GPU Store Generator cleanup

Key Optimizations#

Pipelined Memory Operations

The system overlaps layer N+1 computation with layer N storage.

Stream Synchronization

Three CUDA streams coordinate operations:

current_stream: vLLM’s forward pass computation
load_stream: KV cache loading operations
store_stream: KV cache storing operations

Batch-Level Coordination

Multiple requests are processed together with specialized synchronization:

First request: Provides store stream synchronization to prevent GPU buffer corruption
Last request: Provides load stream synchronization to ensure KV cache availability

Memory Allocation Strategies

Retrieval: Layer-by-layer allocation
Storage: Upfront allocation for all layers

Cache Key Management

Multi-layer cache engine keys use split_layers(N) to create per-layer kubernetes_deployment

Configuration#

Enable layerwise caching by setting:

use_layerwise: true

The system automatically selects appropriate layerwise GPU connectors based on configuration:

VLLMPagedMemLayerwiseGPUConnector: For standard layerwise operations
VLLMBufferLayerwiseGPUConnector: When blending is enabled