Async Loading#
Overview#
This document explains the principle, benefits, differences from vLLM PR 19330, and limitations of the LMCache async_loading feature.
It focuses on LMCache v1 integration with vLLM and the internal storage pipeline.
Key change of components in this feature include:
LMCache async lookup client/server (ZMQ-based)
Storage manager orchestrating backends and concurrency
Cache engine async API entrypoints
vLLM adapter integration points
Principle and Theory#
At a high level, async_loading decouples scheduler-side lookup from worker-side prefetch/retrieval, allowing overlap between I/O and computation while preserving prefix-based correctness.
The scheduler sends lookup requests with token chunk hashes and offsets.
Worker-side servers perform tiered
batched_async_containsover available backends and eagerly launch non-blocking batched get operations for hit prefixes.Completion is tracked via an
EventManagerto safely deliver loaded memory objects back to the requesting path.A weighted semaphore with an
AsyncSerializerprevents allocator deadlocks by shaping concurrency according to chunk budget.
The following Mermaid sequence diagram illustrates the end-to-end flow:
sequenceDiagram
autonumber
participant S as Scheduler (vLLM)
participant LC as LMCacheAsyncLookupClient
participant WS as LMCacheAsyncLookupServer (Worker)
participant SM as StorageManager
participant BE as Backends (LocalCPU/LocalDisk/FSConnector)
participant EM as EventManager
S->>LC: lookup(token_ids, lookup_id, request_configs)
note right of LC: Hashes + offsets via TokenDatabase
LC->>WS: ZMQ PUSH multipart [lookup_id, hashes, offsets, configs]
WS->>SM: async_lookup_and_prefetch(lookup_id, keys, cum_chunk_lengths)
SM->>BE: batched_async_contains(lookup_id, keys, pin=True)
alt prefix hit across tiers
BE-->>SM: num_hit_chunks (per tier)
SM->>BE: batched_get_non_blocking(lookup_id, hit_prefix)
BE-->>SM: Future[List[MemoryObj]]
SM->>EM: add_event(EventType.LOADING, lookup_id, gather_all)
SM-->>WS: send_response_to_scheduler(lookup_id, retrieved_length)
WS-->>LC: ZMQ PUSH [lookup_id, num_hit_tokens]
else cache miss
SM-->>WS: send_response_to_scheduler(lookup_id, 0)
WS-->>LC: ZMQ PUSH [lookup_id, 0]
end
Architecture (Worker Side)#
flowchart LR
subgraph Worker
direction TB
A["LMCacheAsyncLookupServer<br/>ZMQ PULL/PUSH"]
B["StorageManager<br/>Async loop (thread)"]
C["AsyncSerializer<br/>WeightedSemaphore"]
D["EventManager<br/>EventType.LOADING"]
end
subgraph Backends
E["LocalCPUBackend<br/>contains/get"]
F["LocalDiskBackend<br/>async contains/get"]
G["FSConnector<br/>remote FS"]
end
A --> B
B --> C
B --> D
B -.contains/get.-> E
B -.contains/get.-> F
B -.contains/get.-> G
style E fill:#dff,stroke:#333,stroke-width:1px
style F fill:#ffd,stroke:#333,stroke-width:1px
style G fill:#dfd,stroke:#333,stroke-width:1px
Benefits#
- Performance overlap
I/O–Compute Overlap: Decoupling lookup/prefetch from loading enables fetching KV chunks while vLLM continues scheduling/computation.
- Robustness and error handling
Event-driven Synchronization:
EventManagerensures safe hand-off of futures and avoids race conditions between threads and the async loop.Backpressure & Deadlock Avoidance:
AsyncSerializerwith a weighted semaphore caps concurrent chunk retrievals based on allocator budget, preventing starvation or allocator lockups.Graceful Miss Path: Immediate response with
Nonehit tokens when nothing is retrievable; worker returns quickly without stalling the scheduler.
Comparison with vLLM Load Failure Recovery feature#
The VLLM_PR_19330 introduces a fault recovery mechanism for vLLM’s KV connector infrastructure that enables graceful handling of KV cache load failures by automatically detecting failed block loads and rescheduling only affected requests for recomputation from a valid prefix.
By contrast, LMCache’s async_loading is an externalized caching layer with its own client/server, storage backends, and concurrency control.
Limitations#
Only works with vllm merged VLLM_PR_23620
- Backend support constraint: This feature currently requires backends that implement
batched_async_contains; limited to a few backends, e.g.: LocalCpuBackendLocalDiskBackendS3ConnectorFSConnectorRedisConnector/RedisClusterConnector
- Backend support constraint: This feature currently requires backends that implement
save_unfull_chunk: Automatically disabled in async mode for correctness in prefix chunking.
Future Work#
Introduce a default
batched_async_containsimplementation, so all backends can supportasync_loading.Refactor
AsyncSerializerto support being enabled together withsave_unfull_chunkandPDBackend.Add metrics and observability to track the number of asynchronous lookup requests and the number of occupied
MemoryObjinstances.Improve the lookup framework by passing vLLM prefix cache hit tokens so that async lookup can skip loading parts already hit in vLLM.