Architecture & Developer Guide#
This page describes the internal architecture of LMCache multiprocess mode. It is aimed at developers who want to understand, debug, or extend the system.
High-Level Architecture#
vLLM Instance(s)
|
| ZMQ (tcp)
v
MessageQueueServer (mq.py)
|
| dispatch by RequestType
v
MPCacheEngine (server.py)
|
|--- TokenHasher / SessionManager
|
v
StorageManager (distributed/storage_manager.py)
|
|--- L1Manager (l1_manager.py)
| |--- L1MemoryManager (memory allocator)
| |--- TTLLock per object (read/write)
|
|--- StoreController -----> L2 Adapter(s) (async L1->L2 push)
|--- PrefetchController ---> L2 Adapter(s) (async L2->L1 load)
|--- EvictionController ----> L1Manager (watermark-triggered eviction)
|
v
EventBus + OTel providers (observability)
Server Variants#
All three server entry points share the same MPCacheEngine and
StorageManager core.
``server.py`` – The default ZMQ-only server. Creates an MPCacheEngine
and a MessageQueueServer, registers handlers for all core
RequestType values, and blocks in a keep-alive loop.
``blend_server_v2.py`` – Extends MPCacheEngine with BlendEngineV2,
which adds CacheBlend operations (CB_REGISTER_KV_CACHE,
CB_LOOKUP_PRE_COMPUTED, CB_STORE_PRE_COMPUTED,
CB_RETRIEVE_PRE_COMPUTED, CB_STORE_FINAL). Enables non-prefix KV
cache reuse across document paragraphs.
``http_server.py`` – Wraps run_cache_server() (from server.py)
inside a FastAPI application. Endpoints are contributed by modules under
http_apis/ and auto-registered via HTTPAPIRegistry: GET / (basic
liveness), GET /healthcheck for Kubernetes probes, POST /clear-cache
for clearing all KV cache data in L1 (CPU) memory, and GET /status
for inspecting detailed internal state. The ZMQ server runs as part of the
same process, and any configured runtime plugins are spawned by
MPRuntimePluginLauncher during FastAPI startup.
ZMQ Protocol#
Communication between vLLM and LMCache uses ZMQ (DEALER/ROUTER pattern).
RequestType enum (defined in protocols/base.py):
Request Type |
Handler Type |
Description |
|---|---|---|
|
SYNC |
Register GPU KV cache tensors for a vLLM instance. |
|
SYNC |
Unregister KV cache tensors. |
|
BLOCKING |
Store KV cache chunks from GPU to L1 (CPU). |
|
BLOCKING |
Copy KV cache chunks from L1 (CPU) back to GPU. |
|
BLOCKING |
Submit a prefix lookup; the prefetch job is tracked server-side by request_id. |
|
BLOCKING |
Poll a prefetch job by request_id. Returns the loaded chunk count
when done, or |
|
BLOCKING |
Query the lookup-phase hit chunk count by request_id, before the
prefetch finishes. Returns |
|
BLOCKING |
Release read locks from a cancelled lookup without doing a full RETRIEVE. |
|
BLOCKING |
Remove session state for a finished request. |
|
BLOCKING |
Clear all cached data. |
|
SYNC |
Return the server’s chunk size. |
|
BLOCKING |
Liveness ping; the handler always returns |
|
BLOCKING |
Fire-and-forget channel for the vLLM scheduler to report GPU block allocation events to the observability subsystem. |
|
SYNC |
Debug heartbeat – returns a confirmation string. |
|
SYNC |
(Blend) Register CacheBlend KV buffer. |
|
SYNC |
(Blend) Unregister CacheBlend KV buffer. |
|
BLOCKING |
(Blend) Store pre-computed paragraph chunks. |
|
BLOCKING |
(Blend) Lookup pre-computed paragraph chunks. |
|
BLOCKING |
(Blend) Retrieve pre-computed paragraph chunks to GPU. |
|
BLOCKING |
(Blend) Store final blended chunks. |
|
BLOCKING |
(Blend V2) Lookup pre-computed chunks; returns
|
|
BLOCKING |
(Blend V2) Retrieve pre-computed chunks using the
|
Handler types:
SYNC – Runs directly in the ZMQ main loop (fast, non-blocking).
BLOCKING – Dispatched to a thread pool (may involve GPU copies or I/O).
Config System#
Each config module exposes a composable triple:
(DataclassConfig, add_*_args(parser), parse_args_to_*_config(args))
server.py:parse_args() composes them:
parser = argparse.ArgumentParser(...)
add_mp_server_args(parser) # from multiprocess/config.py
# includes runtime-plugin args
# (--runtime-plugin-locations,
# --runtime-plugin-config)
add_storage_manager_args(parser) # from distributed/config.py
# which internally calls add_l2_adapters_args(parser)
add_observability_args(parser) # from mp_observability/config.py
Both blend_server_v2.py and http_server.py reuse this pattern, adding
add_http_frontend_args() for the HTTP variant.
Distributed Storage#
StorageManager#
lmcache/v1/distributed/storage_manager.py
The top-level manager that wires together L1, L2, and all controllers. Key methods:
reserve_write()/finish_write()– Two-phase write into L1.submit_prefetch_task()/query_prefetch_status()– Async lookup + L2 prefetch.read_prefetched_results()/finish_read_prefetched()– Read prefetched data from L1 with automatic lock management.
L1Manager#
lmcache/v1/distributed/l1_manager.py
Manages objects in CPU memory with a state machine:
None --> write_locked --> ready --> read_locked
(reserve_write) (finish_write) (reserve_read)
| |
v v
evictable finish_read -> ready
Each object has two TTLLock instances (read and write) with configurable
timeouts to prevent deadlocks from crashed clients.
The L1MemoryManager handles the underlying memory allocation (lazy growth
up to --l1-size-gb).
L2 Adapters#
lmcache/v1/distributed/l2_adapters/
The L2AdapterInterface (in base.py) defines three async task methods:
submit_store_task(key, data)– Push data to L2.submit_lookup_and_lock_task(keys)– Check if keys exist in L2.submit_load_task(keys, layout_desc)– Load data from L2 into L1.
The factory function create_l2_adapter() (in __init__.py) uses
isinstance() on the config type to instantiate the correct adapter.
New adapter types are registered via register_l2_adapter_type() in
config.py.
Controllers#
StoreController (storage_controllers/store_controller.py):
Event-driven background thread that uses select.poll() on listener eventfd
and adapter store eventfds. When new objects appear in L1 (signaled via
StoreListener), it submits async store tasks to each L2 adapter based on
the StorePolicy.
EvictionController (storage_controllers/eviction_controller.py):
Periodically checks L1 memory usage against the watermark threshold. When
triggered, evicts objects using the configured policy (LRU,
IsolatedLRU, or noop) until usage drops below the target.
IsolatedLRU evicts per cache_salt against limits registered through
the /api/quota HTTP endpoints; see /api/quota — per-cache_salt quota management.
PrefetchController (storage_controllers/prefetch_controller.py):
Handles L2 lookup and load requests submitted by StorageManager during
LOOKUP RPCs. When keys are not in L1, it queries L2 adapters and loads
found data back into L1.
Request Flows#
LOOKUP Flow#
vLLM MPCacheEngine StorageManager L1Manager L2 (PrefetchController)
| | | | |
|---LOOKUP(key)-------->| | | |
| |--submit_prefetch------>| | |
| | |--reserve_read----->| |
| | |<--hit_count--------| |
| | |--submit_prefetch_request--------------->|
| | | (remaining keys) |
| |--query_prefetch------->| | |
| | |--query_prefetch_result----------------->|
| |<--found_count----------| | |
|<--found_count---------| | | |
STORE Flow#
vLLM MPCacheEngine StorageManager L1Manager
| | | |
|---STORE(key,blocks)-->| | |
| |--reserve_write-------->| |
| | |--reserve_write---->|
| | |<--memory_objs------|
| | (GPU->CPU copy) | |
| |--finish_write--------->| |
| | |--finish_write----->|
| | | |
| | | [StoreController detects new objects]
| | | [async L1->L2 push via adapters]
|<--event_handle--------| | |
RETRIEVE Flow#
vLLM MPCacheEngine StorageManager L1Manager
| | | |
|---RETRIEVE(key)------>| | |
| |--read_prefetched------>| |
| | |--unsafe_read------>|
| | |<--memory_objs------|
| | (CPU->GPU copy) | |
| |--finish_read_prefetch->| |
| | |--finish_read------>|
|<--event_handle--------| | |
Observability Internals#
EventBus (lmcache/v1/mp_observability/event_bus.py) is a global
singleton initialized at server startup by init_observability().
Producers (L1Manager, StorageManager, MPCacheEngine) publish Event
objects to a bounded queue (--event-bus-queue-size, default 10000,
tail-drop on overflow). A background drain thread dispatches each
event to all registered subscribers.
Subscribers live under lmcache/v1/mp_observability/subscribers/
and are grouped by concern: metrics/ (OTel counters and lifecycle
histograms), logging/ (Python logging handlers, lookup-hash JSONL),
and tracing/ (OTel spans built from START/END event pairs).
init_observability() registers the set selected by CLI flags
(--disable-metrics, --disable-logging, --enable-tracing).
OTel providers are set up via otel_init.py before subscribers
are constructed, so module-level get_meter() / get_tracer()
calls bind to the real provider. Metrics are exported both to an
in-process Prometheus /metrics endpoint (--prometheus-port,
default 9090) and, when --otlp-endpoint is set, pushed to an OTel
collector.
How to Extend#
Adding a new L2 adapter#
Create a new *_l2_adapter.py module under
lmcache/v1/distributed/l2_adapters/ — __init__.py auto-discovers
modules matching that suffix via pkgutil and imports them lazily on
first use, so no other files need to be modified.
Create a config class subclassing
L2AdapterConfigBasewithfrom_dict()andhelp()methods.Create an adapter class implementing
L2AdapterInterface, and a small factory function(config, l1_memory_desc) -> L2AdapterInterface.At module level, self-register both the config and the factory:
register_l2_adapter_type("my_adapter", MyAdapterConfig) register_l2_adapter_factory("my_adapter", _create_my_adapter)
See mock_l2_adapter.py or s3_l2_adapter.py for reference
implementations.
Adding an observability subscriber#
Create a subscriber class subclassing
EventSubscriber(defined inlmcache/v1/mp_observability/event_bus.py): implementget_subscriptions()to return an{EventType: callback}mapping; optionally overrideshutdown()for cleanup.Place the class under the appropriate concern group (
subscribers/metrics/,subscribers/logging/, orsubscribers/tracing/) and export it from that package’s__init__.py.Register the subscriber in
init_observability()(lmcache/v1/mp_observability/config.py) viabus.register_subscriber(...)inside the branch matching its concern (metrics / logging / tracing), gated on the corresponding CLI flag if needed.
Adding a new request type#
Add a new member to
RequestTypeinprotocols/base.py.Create a
ProtocolDefinitionin the appropriateprotocols/*.pyfile (engine,controller,observability,debug,blend, orblend_v2) and add the request name to that module’sREQUEST_NAMES.Implement the handler method on
MPCacheEngine(orBlendEngineV2).Register the handler in
run_cache_server()viaadd_handler_helper().
Key Source Files#
File |
Purpose |
|---|---|
|
MPCacheEngine + ZMQ server entry point |
|
MPServerConfig, HTTPFrontendConfig |
|
BlendEngineV2 (extends MPCacheEngine) |
|
FastAPI wrapper with health check and many other useful APIs |
|
|
|
Extensible HTTP endpoints ( |
|
|
|
RequestType, HandlerType, ProtocolDefinition |
|
StorageManager (top-level manager) |
|
StorageManagerConfig hierarchy |
|
L1Manager (object state machine) |
|
L2 adapter config registry |
|
L2AdapterInterface |
|
StoreController (event-driven L1->L2) |
|
EvictionController (watermark-triggered) |
|
PrefetchController (L2->L1 on miss) |
|
ObservabilityConfig + |
|
EventBus singleton and |
|
|
|
OTel metrics / tracing provider setup |
|
Metrics, logging, and tracing subscribers |
|
Trace recording ( |