架构与开发者指南#

本页描述了 LMCache 多进程模式的内部架构。旨在帮助希望理解、调试或扩展系统的开发人员。

高层架构 #

vLLM Instance(s)
     |
     | ZMQ (tcp)
     v
MessageQueueServer (mq.py)
     |
     | dispatch by RequestType
     v
MPCacheServer (server.py)
     |
     |--- TokenHasher / SessionManager
     |
     v
StorageManager (distributed/storage_manager.py)
     |
     |--- L1Manager (l1_manager.py)
     |       |--- L1MemoryManager (CPU DRAM) or
     |       |    GDSL1MemoryManager (NVMe slab via cuFile)
     |       |--- TTLLock per object (read/write)
     |
     |--- StoreController  -----> L2 Adapter(s) (async L1->L2 push)
     |--- PrefetchController ---> L2 Adapter(s) (async L2->L1 load)
     |--- EvictionController ----> L1Manager (watermark-triggered eviction)
     |
     v
EventBus + OTel providers (observability)

所有服务器入口点共享相同的 MPCacheServer 和 StorageManager 核心。 MPCacheServer 现在是一个轻量级合成器：它持有一个 MPCacheServerContext 和一个由 build_engine_modules()``（在 ``server.py 中）基于 --engine-type 和 --supported-transfer-mode 组装的 EngineModule 实例列表。

``server.py`` -- The default ZMQ-only server. Creates an MPCacheServer, assembles the engine modules (LookupModule + ManagementModule + GPUTransferModule and/or NonGPUTransferModule depending on --supported-transfer-mode — gpu or non_gpu loads just one, auto (default) loads both — plus a CacheBlend module when --engine-type is set: blend appends BlendV3Module (the current paged-aware implementation), and blend_legacy appends BlendModule (the original)). Starts a MessageQueueServer, registers handlers for every RequestType exposed by the loaded modules, and blocks in a keep-alive loop.

``modules/blend.py`` -- Defines BlendModule and BlendEngineV2, which add the original CacheBlend operations (CB_REGISTER_KV_CACHE, CB_LOOKUP_PRE_COMPUTED, CB_STORE_PRE_COMPUTED, CB_RETRIEVE_PRE_COMPUTED, CB_STORE_FINAL and their V2 variants). Enables non-prefix KV cache reuse across document paragraphs. Selected by passing --engine-type blend_legacy to lmcache server.

``modules/blend_v3.py`` -- Defines BlendV3Module, the paged-aware CacheBlend V3 pipeline that runs on the sparse-prefetch path. Adds the V3 RPCs (CB_REGISTER_ROPE_V3, CB_UNREGISTER_ROPE_V3, CB_RETRIEVE_PRE_COMPUTED_V3, CB_UNIFIED_LOOKUP) and reuses the existing GPUTransferModule and LookupModule. Selected by passing --engine-type blend to lmcache server.

Both blend variants require --supported-transfer-mode to be gpu or auto and will refuse to load when it is non_gpu.

``http_server.py`` -- 在 FastAPI 应用程序中封装 run_cache_server() (来自 server.py)。端点由 http_apis/ 下的模块贡献，并通过 HTTPAPIRegistry 自动注册：GET / (基本存活检查)、GET /healthcheck 用于 Kubernetes 探针、POST /clear-cache 用于清除 L1 (CPU) 内存中的所有 KV 缓存数据，以及 GET /status 用于检查详细的内部状态。 ZMQ 服务器作为同一进程的一部分运行，任何配置的运行时插件在 FastAPI 启动期间由 MPRuntimePluginLauncher 生成。

ZMQ 协议 #

vLLM 和 LMCache 之间的通信使用 ZMQ（DEALER/ROUTER 模式）。

**RequestType 枚举**（定义在 protocols/base.py）：

请求类型	处理程序类型	描述
`REGISTER_KV_CACHE`	同步	为 vLLM 实例注册 GPU KV Cache 张量。
`UNREGISTER_KV_CACHE`	同步	注销 KV Cache 张量。
`REGISTER_KV_CACHE_NON_GPU_CONTEXT`	同步	Register a non-GPU KV cache context (CPU/accelerator workers using the PREPARE/COMMIT transfer path). Loaded only when `--supported-transfer-mode` is `non_gpu` or `auto`. Returns a `RegisterNonGpuContextResponse` carrying the SHM segment name and pool size when the SHM path is in use (empty for the pickle path).
`UNREGISTER_KV_CACHE_NON_GPU_CONTEXT`	同步	Unregister a non-GPU KV cache context.
`STORE`	阻塞	Store KV cache chunks from GPU to L1 (CPU). GPU transfer path (CUDA IPC); loaded only when `--supported-transfer-mode` is `gpu` or `auto`.
`RETRIEVE`	阻塞	Copy KV cache chunks from L1 (CPU) back to GPU. GPU transfer path (CUDA IPC); loaded only when `--supported-transfer-mode` is `gpu` or `auto`.
`PREPARE_STORE`	阻塞	(Non-GPU path) Worker asks the server to prepare store-side transfer state for a key. Loaded when `--supported-transfer-mode` is `non_gpu` or `auto`.
`COMMIT_STORE`	阻塞	(Non-GPU path) Worker commits the chunk's serialized bytes (pickle path) or releases the prepared SHM slot (SHM path) so the server can persist into L1 storage.
`PREPARE_RETRIEVE`	阻塞	(Non-GPU path) Worker asks the server to prepare the retrieval payload for a key. The pickle path returns the bytes inline; the SHM path returns slot info so the worker can read from shared memory.
`COMMIT_RETRIEVE`	阻塞	(Non-GPU path) Worker acknowledges retrieval completion so the server can release the underlying read locks and reclaim any transport state.
`LOOKUP`	阻塞	提交前缀查找；预取作业由 request_id 在服务器端进行跟踪。
`QUERY_PREFETCH_STATUS`	阻塞	通过 request_id 轮询预取作业。完成时返回加载的块数，预取仍在进行时返回 `None`。
`QUERY_PREFETCH_LOOKUP_HITS`	阻塞	在预取完成之前，通过 request_id 查询查找阶段的命中块计数。当查找仍在运行时返回 `None`。
`FREE_LOOKUP_LOCKS`	阻塞	从取消的查找中释放读取锁，而无需执行完整的 RETRIEVE。
`END_SESSION`	阻塞	移除已完成请求的会话状态。
`CLEAR`	阻塞	清除所有缓存数据。
`GET_CHUNK_SIZE`	同步	返回服务器的块大小。
`PING`	阻塞	存活探测；处理程序始终返回 `True`。
`REPORT_BLOCK_ALLOCATION`	阻塞	vLLM 调度器的火忘通道，用于向可观察性子系统报告 GPU 块分配事件。
`NOOP`	同步	调试心跳 -- 返回确认字符串。
`CB_REGISTER_KV_CACHE`	同步	(Blend) 注册 CacheBlend KV 缓冲区。
`CB_UNREGISTER_KV_CACHE`	同步	(Blend) 取消注册 CacheBlend KV 缓冲区。
`CB_STORE_PRE_COMPUTED`	阻塞	(Blend) 存储预计算的段落块。
`CB_LOOKUP_PRE_COMPUTED`	阻塞	(Blend) 查找预计算的段落块。
`CB_RETRIEVE_PRE_COMPUTED`	阻塞	(Blend) 将预计算的段落块检索到 GPU。
`CB_STORE_FINAL`	阻塞	(Blend) 存储最终混合块。
`CB_LOOKUP_PRE_COMPUTED_V2`	阻塞	（Blend V2）查找预计算的块；返回 `CBMatchResult` 条目（包含旧范围/当前范围和每块哈希），以便检索步骤可以跳过重新哈希。
`CB_RETRIEVE_PRE_COMPUTED_V2`	阻塞	（Blend V2）使用 `CB_LOOKUP_PRE_COMPUTED_V2` 返回的 `CBMatchResult` 列表检索预计算块。
`CB_REGISTER_ROPE_V3`	同步	(Blend V3) Share the RoPE cos/sin cache onto a context already registered via `REGISTER_KV_CACHE`.
`CB_UNREGISTER_ROPE_V3`	同步	(Blend V3) Drop the RoPE state (paged KV cache lives on; use `UNREGISTER_KV_CACHE` to release that).
`CB_RETRIEVE_PRE_COMPUTED_V3`	阻塞	(Blend V3) Scatter all matched chunks (prefix- and non-prefix-hit) into paged KV by per-token block ID; re-RoPE only the shifted subset.
`CB_UNIFIED_LOOKUP`	阻塞	(Blend V3) Sole live lookup path: one RPC runs prefix + non-prefix match, reconciles, issues one sparse-coalesced prefetch, and classifies per-TP-rank. Returns `CBUnifiedLookupResult` (or `None` while the prefetch is still in flight).

处理程序类型:

同步 -- 直接在 ZMQ 主循环中运行（快速，非阻塞）。
阻塞 -- 分配到线程池（可能涉及 GPU 复制或 I/O）。

配置系统 #

每个配置模块都暴露一个可组合的三元组：

(DataclassConfig, add_*_args(parser), parse_args_to_*_config(args))

server.py:parse_args() 组合它们：

parser = argparse.ArgumentParser(...)
add_mp_server_args(parser)        # from multiprocess/config.py
                                  # includes runtime-plugin args
                                  # (--runtime-plugin-locations,
                                  #  --runtime-plugin-config)
add_storage_manager_args(parser)  # from distributed/config.py
  # which internally calls add_l2_adapters_args(parser)
add_observability_args(parser)    # from mp_observability/config.py

http_server.py reuses this pattern, adding add_http_frontend_args() and add_coordinator_args() for the lmcache server CLI. CacheBlend is no longer a separate entry point — it is opted into at runtime by passing --engine-type to server.py (or lmcache server). --engine-type blend appends BlendV3Module (the current paged-aware implementation), while --engine-type blend_legacy appends BlendModule (the original).

分布式存储 #

StorageManager #

lmcache/v1/distributed/storage_manager.py

将 L1、L2 和所有控制器连接在一起的顶级管理器。关键方法：

reserve_write() / finish_write() -- L1 的两阶段写入。
submit_prefetch_task() / query_prefetch_status() -- 异步查找 + L2 预取。
read_prefetched_results() / finish_read_prefetched() -- 从 L1 读取预取的数据，并自动管理锁。

L1Manager #

lmcache/v1/distributed/l1_manager.py

在 CPU 内存中使用状态机管理对象：

None --> write_locked --> ready --> read_locked
          (reserve_write)  (finish_write)  (reserve_read)
                              |                |
                              v                v
                           evictable      finish_read -> ready

每个对象都有两个 TTLLock 实例（读和写），并具有可配置的超时，以防止因客户端崩溃而导致的死锁。

The underlying memory allocation is handled by one of two interchangeable tiers selected at startup (both satisfy L1ManagerProtocol):

L1MemoryManager (default) -- pinned CPU DRAM, with lazy growth up to --l1-size-gb.
GDSL1MemoryManager -- an NVMe slab file when --gds-l1-path is set. The bytes live on disk; reads/writes DMA directly between the GPU staging buffer and the slab via cuFile, driven by the process-global GDSContext (gpu_connector/gds_context.py) and dispatched from gpu_ops. The CPU tier is disabled in this mode.

L2 适配器 #

lmcache/v1/distributed/l2_adapters/

L2AdapterInterface``（在 ``base.py 中）定义了三个异步任务方法：

submit_store_task(key, data) -- 将数据推送到 L2.
submit_lookup_and_lock_task(keys) -- 检查 keys 是否存在于 L2 中。
submit_load_task(keys, layout_desc) -- 从 L2 加载数据到 L1。

工厂函数 create_l2_adapter() （在 __init__.py 中）使用 isinstance() 对配置类型进行检查，以实例化正确的适配器。

新的适配器类型通过 register_l2_adapter_type() 在 config.py 中注册。

控制器 #

StoreController (storage_controllers/store_controller.py)：事件驱动的后台线程，使用 select.poll() 监听事件文件描述符和适配器存储事件文件描述符。当 L1 中出现新对象时（通过 StoreListener 发出信号），它根据 StorePolicy 向每个 L2 适配器提交异步存储任务。

逐出控制器 (storage_controllers/eviction_controller.py)：定期检查 L1 内存使用情况与水位线阈值的关系。当触发时，使用配置的策略（LRU、IsolatedLRU 或 noop）逐出对象，直到使用量降到目标以下。IsolatedLRU 根据通过 /quota HTTP 端点注册的限制，针对 cache_salt 进行逐出；请参见 /quota — 每个``cache_salt``的配额管理。

预取控制器 (storage_controllers/prefetch_controller.py): 处理 StorageManager 在 LOOKUP RPC 中提交的 L2 查找和加载请求。当键不在 L1 中时，它会查询 L2 适配器并将找到的数据加载回 L1。

请求流程 #

查找流程 #

vLLM MPCacheServer StorageManager L1Manager L2 (PrefetchController): | | | |

|---LOOKUP(key)-------->| | | | | |--submit_prefetch------>| | | | | |--reserve_read----->| | | | |<--hit_count--------| | | | |--submit_prefetch_request--------------->| | | | (remaining keys) | | |--query_prefetch------->| | | | | |--query_prefetch_result----------------->| | |<--found_count----------| | | |<--found_count---------| | | |

存储流程 #

vLLM                MPCacheServer          StorageManager         L1Manager
 |                       |                       |                    |
 |---STORE(key,blocks)-->|                       |                    |
 |                       |--reserve_write-------->|                    |
 |                       |                       |--reserve_write---->|
 |                       |                       |<--memory_objs------|
 |                       |  (GPU->CPU copy)      |                    |
 |                       |--finish_write--------->|                    |
 |                       |                       |--finish_write----->|
 |                       |                       |                    |
 |                       |                       |  [StoreController detects new objects]
 |                       |                       |  [async L1->L2 push via adapters]
 |<--event_handle--------|                       |                    |

获取流程 #

vLLM                MPCacheServer          StorageManager         L1Manager
 |                       |                       |                    |
 |---RETRIEVE(key)------>|                       |                    |
 |                       |--read_prefetched------>|                    |
 |                       |                       |--unsafe_read------>|
 |                       |                       |<--memory_objs------|
 |                       |  (CPU->GPU copy)      |                    |
 |                       |--finish_read_prefetch->|                    |
 |                       |                       |--finish_read------>|
 |<--event_handle--------|                       |                    |

可观察性内部实现 #

EventBus (lmcache/v1/mp_observability/event_bus.py) 是一个在服务器启动时由 init_observability() 初始化的全局单例。生产者（L1Manager、StorageManager、MPCacheServer）将 Event 对象发布到一个有界队列中 (--event-bus-queue-size, 默认 10000，溢出时尾部丢弃)。一个后台排空线程将每个事件分发给所有注册的订阅者。

订阅者 位于 lmcache/v1/mp_observability/subscribers/ 目录下，按关注点分组：metrics/``（OTel 计数器和生命周期直方图）、``logging/``（Python 日志处理程序、查找哈希 JSONL）和 ``tracing/``（由 START/END 事件对构建的 OTel 跨度）。``init_observability() 根据 CLI 标志（--disable-metrics、--disable-logging、--enable-tracing）注册所选的集合。

**OTel 提供者**在构造订阅者之前通过 otel_init.py 进行设置，因此模块级的 get_meter() / get_tracer() 调用绑定到真实的提供者。指标同时导出到进程内的 Prometheus /metrics 端点（--prometheus-port, 默认 9090），并且在设置了 --otlp-endpoint 时，推送到 OTel 收集器。

如何扩展 #

添加新的 L2 适配器 #

在 lmcache/v1/distributed/l2_adapters/ 下创建一个新的 *_l2_adapter.py 模块 — __init__.py 通过 pkgutil 自动发现匹配该后缀的模块，并在首次使用时懒加载导入，因此无需修改其他文件。

创建一个配置类，继承自 L2AdapterConfigBase，并实现 from_dict() 和 help() 方法。
创建一个实现 L2AdapterInterface 的适配器类，以及一个小型工厂函数 (config, l1_memory_desc) -> L2AdapterInterface。

在模块级别，自我注册配置和工厂：

register_l2_adapter_type("my_adapter", MyAdapterConfig)
register_l2_adapter_factory("my_adapter", _create_my_adapter)

请参阅 mock_l2_adapter.py 或 s3_l2_adapter.py 以获取参考实现。

添加可观察性订阅者 #

创建一个继承自 EventSubscriber 的订阅者类（定义在 lmcache/v1/mp_observability/event_bus.py 中）：实现 get_subscriptions() 返回一个 {EventType: callback} 映射；可选地重写 shutdown() 进行清理。
将类放置在适当的关注组（subscribers/metrics/、subscribers/logging/``或``subscribers/tracing/）下，并从该包的``__init__.py``中导出。
在 init_observability() 中注册订阅者 (lmcache/v1/mp_observability/config.py)，通过 bus.register_subscriber(...) 在与其关注点 (metrics / logging / tracing) 匹配的分支中进行注册，如有需要，受相应 CLI 标志的限制。

添加新的请求类型 #

在 protocols/base.py 中向 RequestType 添加一个新成员。
Create a ProtocolDefinition in the appropriate protocols/*.py file (engine, controller, observability, debug, blend, blend_v2, or blend_v3) and add the request name to that module's REQUEST_NAMES.
Implement the handler method on the appropriate EngineModule (e.g. LookupModule, GPUTransferModule, BlendV3Module) and expose it as a HandlerSpec from that module's get_handlers().
run_cache_server() registers every HandlerSpec returned by the loaded modules via add_handler_helper() — no manual registration step is needed.

关键源文件 #

MPCacheServer + ZMQ 服务器入口点
- - lmcache/v1/multiprocess/config.py
  - MPServerConfig, HTTPFrontendConfig
- - lmcache/v1/multiprocess/engine_context.py
  - MPCacheServerContext（传递给每个 EngineModule 的共享状态）
- - lmcache/v1/multiprocess/engine_module.py
  - EngineModule 协议, HandlerSpec, ThreadPoolType (每模块处理程序注册)
- - lmcache/v1/multiprocess/modules/
  - Engine module implementations: lookup.py (LookupModule), management.py (ManagementModule), gpu_transfer.py (GPUTransferModule), non_gpu_transfer.py (NonGPUTransferModule), blend.py (BlendModule / BlendEngineV2, selected by --engine-type blend_legacy), and blend_v3.py (BlendV3Module, the paged-aware CacheBlend V3 pipeline selected by --engine-type blend).
- - lmcache/v1/multiprocess/http_server.py
  - 带健康检查和许多其他有用 API 的 FastAPI 包装器
- - lmcache/v1/multiprocess/http_api_registry.py
  - HTTPAPIRegistry 自动发现 http_apis/ 中的路由器
- - lmcache/v1/multiprocess/http_apis/
  - 可扩展的 HTTP 端点 (/, /healthcheck, /clear-cache, /status)
- - lmcache/v1/multiprocess/mp_runtime_plugin_launcher.py
  - MPRuntimePluginLauncher 通过将完整的服务器配置序列化为环境变量来生成运行时插件
- - lmcache/v1/multiprocess/protocols/base.py
  - 请求类型、处理程序类型、协议定义
- - lmcache/v1/distributed/storage_manager.py
  - 存储管理器（顶层管理器）
- - lmcache/v1/distributed/config.py
  - StorageManagerConfig 层次结构
- - lmcache/v1/distributed/l1_manager.py
  - L1Manager（对象状态机）
- - lmcache/v1/distributed/l2_adapters/config.py
  - L2 适配器配置注册表
- - lmcache/v1/distributed/l2_adapters/base.py
  - L2AdapterInterface
- - lmcache/v1/distributed/storage_controllers/store_controller.py
  - StoreController（事件驱动 L1->L2）
- - lmcache/v1/distributed/storage_controllers/eviction_controller.py
  - 逐出控制器（基于水印触发）
- - lmcache/v1/distributed/storage_controllers/prefetch_controller.py
  - 预取控制器 (未命中时从 L2->L1)
- - lmcache/v1/mp_observability/config.py
  - 可观察性配置 + init_observability() 入口点
- - lmcache/v1/mp_observability/event_bus.py
  - 事件总线单例和 EventSubscriber 基类
- - lmcache/v1/mp_observability/event.py
  - Event / EventType 定义
- - lmcache/v1/mp_observability/otel_init.py
  - OTel 指标 / 跟踪提供程序设置
- - lmcache/v1/mp_observability/subscribers/
  - 指标、日志和追踪订阅者
- - lmcache/v1/mp_observability/trace/
  - 跟踪记录 (--trace-level storage) 捕获堆栈