L2 存储（持久缓存）#

LMCache 多进程模式支持两级存储架构：

L1 (fast tier) -- CPU memory by default, or an NVMe slab via GPUDirect Storage (cuFile) when --gds-l1-path is set, managed by the L1 Manager. All KV cache chunks live here during active use. (Byte-array L2 adapters are unsupported under the GDS L1 tier, which exposes no L1 memory buffer.)
L2 (持久性) -- 耐用的存储后端（基于 NIXL 或普通文件系统/原始块）。StoreController 异步将数据从 L1 推送到 L2，而 PrefetchController 在缓存未命中时将数据从 L2 加载回 L1。

数据流 #

写入路径 (L1 -> L2):

vLLM 通过 STORE RPC 将 KV Cache 块存储到 L1。
StoreController 检测到新对象（通过 eventfd），并异步提交存储任务到每个配置的 L2 适配器。
L2 适配器将数据写入其后端（例如，通过 GDS 的本地 SSD）。

读取路径 (L2 -> L1):

一个 LOOKUP RPC 检查 L1 中的前缀命中。
对于在 L1 中未找到的键，PrefetchController 向 L2 适配器提交查找请求。
如果在 L2 中找到，数据将被加载回 L1，并为待处理的 RETRIEVE RPC 进行读锁定。

适配器类型 #

`nixl_store` -- 基于 NIXL 的持久存储 #

主要的生产适配器。使用 NIXL（NVIDIA 互连库）进行高性能存储 I/O。

必填字段：

backend: 存储后端 -- 其中之一为 POSIX, GDS, GDS_MT, HF3FS, OBJ, AZURE_BLOB。
pool_size: 预分配的存储描述符数量（必须大于 0）。

后端特定参数 (``backend_params``):

基于文件的后端（GDS, GDS_MT, POSIX, HF3FS）需要：

file_path: 存储 L2 数据的目录路径。
use_direct_io: "true" 或 "false" -- 是否使用直接 I/O。

OBJ 和 AZURE_BLOB 后端（对象存储）不需要 file_path。

后端描述：

后端	描述
`POSIX`	标准的 POSIX 文件 I/O。适用于任何文件系统。无直接 I/O。
`GDS`	NVIDIA GPU 直接存储。支持直接的 GPU 到存储的传输，绕过 CPU。需要支持 GDS 的 NVMe SSD。
`GDS_MT`	GDS 的多线程变体，以提高吞吐量。
`HF3FS`	共享文件系统后端（例如，用于分布式/网络存储）。
`OBJ`	对象存储后端。无需本地文件路径。
`AZURE_BLOB`	Azure Blob Storage 的对象存储后端。无需本地文件路径。

配置示例：

# POSIX backend
--l2-adapter '{"type": "nixl_store", "backend": "POSIX", "backend_params": {"file_path": "/data/lmcache/l2", "use_direct_io": "false"}, "pool_size": 64}'

# GDS backend
--l2-adapter '{"type": "nixl_store", "backend": "GDS", "backend_params": {"file_path": "/data/nvme/lmcache", "use_direct_io": "true"}, "pool_size": 128}'

# GDS_MT backend
--l2-adapter '{"type": "nixl_store", "backend": "GDS_MT", "backend_params": {"file_path": "/data/nvme/lmcache", "use_direct_io": "true"}, "pool_size": 128}'

# HF3FS backend
--l2-adapter '{"type": "nixl_store", "backend": "HF3FS", "backend_params": {"file_path": "/mnt/hf3fs/lmcache", "use_direct_io": "false"}, "pool_size": 64}'

# OBJ backend
--l2-adapter '{"type": "nixl_store", "backend": "OBJ", "backend_params": {}, "pool_size": 32}'

# AZURE_BLOB backend
--l2-adapter '{"type": "nixl_store", "backend": "AZURE_BLOB", "backend_params": {"account_url": "https://<account_name>.blob.core.windows.net", "container_name": "<container_name>"}, "pool_size": 32}'

`nixl_store_dynamic` -- 基于 NIXL 的动态存储，具有持久化/恢复功能 #

NIXL 适配器的动态变体，根据操作打开和注册文件，而不是在初始化时预分配。这使得：

持久化/恢复 -- 缓存的 KV 元数据在重启后依然存在。
无文件描述符限制 -- 文件在每次传输时打开和关闭，因此缓存可以超出操作系统的打开文件描述符限制。

备注

仅支持基于文件的后端（POSIX, GDS, GDS_MT, HF3FS）。OBJ 和 AZURE_BLOB 后端尚不支持。

必填字段：

backend: 存储后端 -- 其中之一为 POSIX, GDS, GDS_MT, HF3FS。

后端特定参数 (``backend_params``):

file_path: 存储 L2 数据文件的目录路径。
use_direct_io: "true" 或 "false"。
max_capacity_gb: 最大存储容量（以 GB 为单位）。当达到此限制时，适配器会拒绝存储。此项对于逐出控制器计算使用情况是必需的。

可选字段（用于持久化）：

persist_enabled (bool, 默认 true): 如果 true，数据文件将在关闭时保留在磁盘上。如果 false，所有数据文件将在关闭时被删除。

查找总是在未命中时检查二级存储（磁盘），并在找到文件时懒惰地填充内存中的索引。

配置示例：

# Basic dynamic POSIX backend (persist enabled by default)
--l2-adapter '{"type": "nixl_store_dynamic", "backend": "POSIX", "backend_params": {"file_path": "/data/lmcache/l2", "use_direct_io": "false", "max_capacity_gb": "10"}}'

# Explicitly disable persist
--l2-adapter '{"type": "nixl_store_dynamic", "backend": "POSIX", "backend_params": {"file_path": "/data/lmcache/l2", "use_direct_io": "false", "max_capacity_gb": "10"}, "persist_enabled": false}'

# With eviction
--l2-adapter '{"type": "nixl_store_dynamic", "backend": "GDS", "backend_params": {"file_path": "/data/nvme/l2", "use_direct_io": "true", "max_capacity_gb": "50"}, "eviction": {"eviction_policy": "LRU", "trigger_watermark": 0.9, "eviction_ratio": 0.1}}'

持久化 / 次级查找行为:

在关闭时，适配器默认将数据文件保留在磁盘上（persist_enabled 默认为 true）。如果明确设置为 false，则所有数据文件将被删除，以避免孤立存储。
在启动时，内存中的索引是空的。每次查找未命中都会转到磁盘上的二次查找：如果确定性文件存在，则将其视为命中，并根据文件大小懒惰地填充内存中的索引。

`fs` -- 文件系统支持的存储 #

一个纯文件系统的 L2 适配器，使用异步 I/O (aiofiles)。每个 KV Cache 对象作为一个原始的 .data 文件存储，其名称编码了完整的 ObjectKey。**不**需要 NIXL -- 可以在任何 POSIX 文件系统上工作。

必填字段：

base_path: 存储 KV Cache 文件的目录。

可选字段：

relative_tmp_dir: 写入期间临时文件的相对子目录（完成时进行原子重命名）。
read_ahead_size: 通过首先读取这么多字节来触发文件系统的预读（正整数，可选）。
use_odirect: true 或 false``（默认 ``false）-- 通过 O_DIRECT 绕过页面缓存。

配置示例：

# Basic FS adapter
--l2-adapter '{"type": "fs", "base_path": "/data/lmcache/l2"}'

# With temp directory
--l2-adapter '{"type": "fs", "base_path": "/data/lmcache/l2", "relative_tmp_dir": ".tmp"}'

# With O_DIRECT for bypassing page cache
--l2-adapter '{"type": "fs", "base_path": "/data/lmcache/l2", "use_odirect": true}'

An L2 adapter that maps Device-DAX paths, such as /dev/daxX.X and /dev/daxY.Y, and stores KV cache objects in fixed-size slots. This adapter is intended for byte-addressable memory devices such as persistent memory or CXL memory.

在此版本中，MP dax 适配器是易失性的。它将键索引保存在服务器内存中，并在重启时重建一个空索引。旧字节可能仍然保留在 DAX 设备上，但在 LMCache 服务器重启后无法访问。

Required fields for the legacy single-device form:

device_path: 可映射的 DAX 设备或测试文件的路径。
max_dax_size_gb: 从 device_path 映射的 GiB 数量。
slot_bytes: 固定槽大小（以字节为单位）。这必须足够大，以容纳一个完整的 LMCache 块，因为 MP 内存描述符不暴露非 MP 完整块的大小。

Required fields for the multi-device form:

devices: List of objects with device_path and max_dax_size_gb. The list may be empty only when hotplug_enabled is true.
slot_bytes: Fixed slot size in bytes shared by every DAX device in the adapter facade.

可选字段：

hotplug_enabled (bool, default false): Enables runtime /reconfigure/dax/status, /reconfigure/dax/add, /reconfigure/dax/remove, and /reconfigure/dax/resize.
num_store_workers (int, default 1): 存储工作线程。
num_lookup_workers (int, default 1): 查找工作线程。
num_load_workers (int, default min(4, os.cpu_count())): 加载工作线程。
persist_enabled (bool): 被常见的 L2 配置解析接受，但对 dax 没有影响，因为未实现重启恢复。

配置示例:

# Backward-compatible single-device form.
--l2-adapter '{
  "type": "dax",
  "device_path": "/dev/dax1.0",
  "max_dax_size_gb": 100,
  "slot_bytes": 268435456,
  "num_store_workers": 1,
  "num_lookup_workers": 1,
  "num_load_workers": 4,
  "eviction": {
    "eviction_policy": "LRU",
    "trigger_watermark": 0.9,
    "eviction_ratio": 0.1
  }
}'

# Multi-device hotplug-ready form.
--l2-adapter '{
  "type": "dax",
  "devices": [
    {"device_path": "/dev/daxX.X", "max_dax_size_gb": 100},
    {"device_path": "/dev/daxY.Y", "max_dax_size_gb": 100}
  ],
  "slot_bytes": 268435456,
  "hotplug_enabled": true,
  "num_store_workers": 1,
  "num_lookup_workers": 1,
  "num_load_workers": 4
}'

Runtime management uses JSON bodies because DAX paths contain slashes. See the Device-DAX backend guide for complete examples. These routes use StorageManager's generic L2 adapter reconfiguration API; the HTTP path selects the backend and operation, the DAX adapter interprets the operation payload, and the same interface can be reused by future adapters such as P2P.

curl http://127.0.0.1:9000/reconfigure/dax/status
curl -X POST http://127.0.0.1:9000/reconfigure/dax/add \
  -H 'Content-Type: application/json' \
  -d '{"device_path": "/dev/daxX.X", "size": "100GiB"}'

当前限制：

Runtime hotplug changes only LMCache mappings and metadata. It does not create, destroy, or reconfigure kernel CXL or DAX devices.
Per-TP partitions and on-device restart metadata are not implemented.
仅支持单缓冲区对象。多张量对象会被拒绝。
容量是基于插槽的，而不是基于有效载荷字节的。L2 逐出和使用指标计算占用的插槽。
查找会获取 DAX 侧的外部锁。submit_unlock 在加载/检索完成后释放这些锁，使条目可以再次被逐出。
Remove mode="evict" is destructive for the DAX tier. Remove mode="migrate" requires enough capacity on another active DAX device.

`fs_native` -- 原生 C++ 文件系统连接器 #

一个由原生 C++ LMCacheFSClient 支持的文件系统 L2 适配器，封装在 NativeConnectorL2Adapter 中。I/O 通过一个 C++ 工作线程池进行调度，使用 eventfd 驱动的完成机制，在单个 Python 线程上提供真实的 I/O 队列深度。

必填字段：

base_path: 存储 KV Cache 文件的目录。

可选字段：

num_workers (int, default 4, > 0): 连接器内部的 C++ 工作线程数量。这是真正的 I/O 队列深度——增加此值以提高在总带宽超过每个流带宽的文件系统上的吞吐量。
relative_tmp_dir (str, default ""): 写入期间临时文件的相对子目录（完成时原子重命名）。
use_odirect (bool, default false): 通过 O_DIRECT 绕过页面缓存。需要测量真实的磁盘带宽。请参见下面的对齐注意事项。
read_ahead_size (int, optional): 在打开时通过发出此字节数的预热读取来触发文件系统的预读取。
max_capacity_gb (float, default 0): 客户端使用跟踪的最大 L2 容量（以 GB 为单位）。默认值 0 禁用跟踪。

重要

O_DIRECT 有两个独立的对齐要求：

长度对齐。 传输长度必须是文件系统块大小的倍数。连接器在构造时查询磁盘块大小，并在每次操作时检查 len % disk_block_size。如果长度不是倍数，连接器会默默回退到缓冲打开（不使用 O_DIRECT）进行该操作——正确性得以保留，但您无法获得真正的直接 I/O。为了确保实际使用 O_DIRECT，选择 --chunk-size 使得每块的字节大小是文件系统块大小的倍数。GPFS 和类似的并行文件系统通常使用较大的块（例如几个 MiB）。
内存缓冲区对齐。 I/O 缓冲区指针本身也必须对齐（通常在本地磁盘上对齐到 4096 字节，或在并行文件系统上对齐到 FS 块大小）。这由 --l1-align-bytes 控制（默认值为 4096）——在使用块较大的文件系统时，将其提高以匹配 FS 块大小。如果缓冲区未对齐，底层的 read/write 系统调用将返回 ``EINVAL``（这不会被上面的长度回退路径捕获，并将作为运行时错误出现）。

如果不确定，请先使用 use_odirect: false 并确认正确性，然后再启用 O_DIRECT。

配置示例：

# Basic native FS adapter
--l2-adapter '{"type": "fs_native", "base_path": "/data/lmcache/l2"}'

# Many worker threads for a parallel filesystem (e.g. GPFS, Lustre)
--l2-adapter '{"type": "fs_native", "base_path": "/data/lmcache/l2", "num_workers": 32}'

# O_DIRECT for real-disk benchmarking
--l2-adapter '{"type": "fs_native", "base_path": "/data/lmcache/l2", "num_workers": 32, "use_odirect": true}'

仅缓冲区模式示例。 L1 充当一个纯写缓冲区，吸收在飞行块的峰值突发，同时 C++ 工作池将它们写入磁盘；一旦存储完成，L1 中不会保留任何内容：

lmcache server \
    --host 0.0.0.0 --port 5555 \
    --max-workers 32 \
    --l1-size-gb 32 --l1-use-lazy \
    --eviction-policy noop \
    --l2-store-policy skip_l1 \
    --l2-adapter '{"type": "fs_native", "base_path": "/data/lmcache/l2", "num_workers": 32, "use_odirect": true}'

`raw_block` -- 原始块设备支持的持久存储 #

一个内置的 L2 适配器，它使用 Rust 原始设备 I/O 绑定将 KV 对象存储在原始块设备或预先大小化的文件的固定大小槽中。它重用现有的原始块元数据检查点模型，并在预取期间直接写入调用者提供的加载缓冲区。

必填字段：

device_path: 原始设备路径或预先大小的文件路径。
slot_bytes: 固定的槽大小（以字节为单位）。必须与 block_align 对齐。

可选字段：

capacity_bytes: 可选的可用设备字节上限。默认值 0 表示使用整个设备/文件大小。
use_odirect: true 或 false``（默认 ``true）。
block_align: 设备对齐字节数（默认 4096）。
header_bytes: 每个槽位的头部保留字节数（默认 4096）。
meta_total_bytes: 保留的元数据检查点区域（默认 256MiB）。
meta_magic / meta_version: 元数据检查点标识/版本控制。
meta_checkpoint_interval_sec / meta_idle_quiet_ms / meta_enable_periodic / meta_verify_on_load: 从遗留的原始块后端继承的检查点和恢复控制。
load_checkpoint_on_init: 在启动时加载现有的设备元数据检查点（默认值为 true）。设置为 false 以使用空的内存索引开始。
enable_zero_copy: 尝试在可能的情况下使用对齐的直接缓冲区 I/O。
io_engine: Rust 原始块 I/O 引擎。有效值为 "posix"``（默认同步 ``pread/pwrite 路径），``"io_uring"``（直接 Rust io_uring 系统调用路径）。
use_uring_cmd: Enable NVMe passthrough via io_uring command interface for direct device access. Requires io_engine="io_uring" and NVMe character device node (e.g., /dev/ng0n1).
iouring_queue_depth: io_engine="io_uring" 的队列深度。
max_data_transfer_size: Maximum data transfer size for use_uring_cmd=true. Large transfers are split into smaller chunks that fit within device limits.
num_store_workers / num_lookup_workers / num_load_workers: 每种操作类型的工作线程数量。

注意：

raw_block 是一个由服务器拥有的 MP 适配器。它不支持 MP 模式下的每个 TP 设备路径映射。
raw_block remains "type": "raw_block" for all supported engines.
raw_block 拥有设备上的槽分配、检查点和通过 RawBlockCore 的恢复。槽回收由共享/全局 L2 逐出控制器或显式的 delete() 调用驱动。
如果启用了 use_odirect，则服务器的 --l1-align-bytes 应至少为 block_align。
persist_enabled 必须保持为 true 以便此适配器正常工作。
For use_uring_cmd=true, device_path must use the NVMe character device node (e.g., /dev/ng0n1) instead of the block device node (/dev/nvme0n1). The character device provides direct NVMe command passthrough.
use_uring_cmd requires io_engine="io_uring" to be set.
When use_uring_cmd=true, use_odirect is ignored for NVMe namespace character devices.

配置示例：

# Basic raw_block with posix I/O
--l2-adapter '{"type": "raw_block", "device_path": "/dev/nvme0n1", "slot_bytes": 1048576, "block_align": 4096, "header_bytes": 4096, "meta_total_bytes": 268435456, "use_odirect": true, "num_store_workers": 2, "num_lookup_workers": 1, "num_load_workers": 4}'

# With io_uring
--l2-adapter '{"type": "raw_block", "device_path": "/dev/nvme0n1", "slot_bytes": 1048576, "io_engine": "io_uring", "iouring_queue_depth": 256, "use_odirect": true}'

# With io_uring_cmd (NVMe passthrough)
--l2-adapter '{"type": "raw_block", "device_path": "/dev/ng0n1", "slot_bytes": 1048576, "io_engine": "io_uring", "use_uring_cmd": true, "iouring_queue_depth": 256, "max_data_transfer_size": 131072, "use_odirect": false}'

# With eviction
--l2-adapter '{"type": "raw_block", "device_path": "/dev/nvme0n1", "slot_bytes": 1048576, "load_checkpoint_on_init": false, "eviction": {"eviction_policy": "LRU", "trigger_watermark": 0.9, "eviction_ratio": 0.1}}'

`mooncake_store` -- Mooncake Store 原生连接器 #

一个由原生 C++ Mooncake Store 连接器支持的 L2 适配器。使用 Mooncake 进行高性能分布式 KV Cache 存储，并支持 RDMA。

当 Mooncake 配置为 "protocol": "rdma" 时，LMCache 还必须有一个有效的连续 L1 内存区域可用。分布式存储管理器会在 MP 模式下自动将此 L1 内存描述符传递给适配器工厂。如果描述符缺失或无效，适配器创建将失败并抛出 ValueError，而不是默默回退到非 RDMA 路径。

前提条件 -- 构建 Mooncake 支持：

Mooncake 扩展**默认情况下**并未构建。您必须显式启用它：

BUILD_MOONCAKE=1 pip install -e . --verbose

BUILD_MOONCAKE 环境变量控制编译：

BUILD_MOONCAKE=1：启用 Mooncake C++ 扩展。
BUILD_MOONCAKE=0: 强制禁用（最高优先级），即使 MOONCAKE_INCLUDE_DIR 已设置。
未设置：回退到检查 MOONCAKE_INCLUDE_DIR 以保持向后兼容。如果 MOONCAKE_INCLUDE_DIR 也未设置，则跳过该扩展。

如果系统的包含路径中没有安装 Mooncake 头文件（例如，/usr/local/include），您必须明确指定它们：

BUILD_MOONCAKE=1 \
MOONCAKE_INCLUDE_DIR=/path/to/mooncake/include \
MOONCAKE_LIB_DIR=/path/to/mooncake/lib \
pip install -e . --verbose

LMCache-specific fields:

num_workers: 共享池的 C++ 工作线程数量（默认 4，必须大于 0）。
per_op_workers (dict[str, int], 可选): 一个将车道键映射到专用工作线程计数的字典。支持的键:
- "lookup" — 用于 EXISTS 操作的线程。
- "retrieve" — 处理 GET / 加载操作的线程。
- "store" — 处理 SET / 放置操作的线程。
- "delete" — 处理 DELETE 操作的线程。
在字典中不存在车道键的操作使用共享的 num_workers 池。没有必要设置所有键 — 您可以仅配置需要专用池的车道。

Mooncake 字段：

JSON 配置中的所有其他键（除了 type, num_workers, per_op_workers, 和 eviction）都将原样转发到 Mooncake 的 setup_internal(ConfigDict)。有关可用的设置键（例如 local_hostname, metadata_server, master_server_addr, protocol, rdma_devices, global_segment_size），请参阅 Mooncake 文档。

配置示例:

# Shared pool (default)
--l2-adapter '{
  "type": "mooncake_store",
  "num_workers": 4,
  "local_hostname": "node01",
  "metadata_server": "http://localhost:8080/metadata",
  "master_server_addr": "localhost:50051",
  "protocol": "tcp",
  "local_buffer_size": "3221225472",
  "global_segment_size": "3221225472"
}'

# Per-operation pools (GET-heavy workload)
--l2-adapter '{
  "type": "mooncake_store",
  "per_op_workers": {
    "lookup": 2,
    "retrieve": 16,
    "store": 4
  },
  "local_hostname": "node01",
  "metadata_server": "http://localhost:8080/metadata",
  "master_server_addr": "localhost:50051",
  "protocol": "tcp"
}'

有关完整的 Mooncake 设置说明（主服务、元数据服务器等），请参见 Mooncake 。

RDMA 注意事项：

protocol: \"rdma\" 需要一个有效的 LMCache L1 内存描述符。
在使用 protocol: \"rdma\" 时，建议通过 --no-l1-use-lazy 禁用延迟 L1 分配，以便在 Mooncake 注册之前完全分配 L1 缓冲区。
protocol: \"tcp\" 不需要 L1 预注册。
如果 Mooncake RDMA 初始化在适配器创建时失败，请验证 LMCache L1 内存是否已启用，并确保描述符具有非零指针和大小。

`aerospike` -- Aerospike native connector #

An L2 adapter backed by the native C++ Aerospike connector (the same ConnectorBase worker-pool harness used by fs_native), wrapped with NativeConnectorL2Adapter. KV objects are stored under a meta record plus optional payload segments so values larger than the server record cap are transparently sharded.

Prerequisites -- Building with Aerospike support:

The Aerospike extension is not built by default. Install the Aerospike C client, then build with BUILD_AEROSPIKE=1 (or set AEROSPIKE_INCLUDE_DIR):

BUILD_AEROSPIKE=1 pip install -e .

See 添加本地连接器 (section "Built-in Aerospike backend") for installing the C client into .deps/ and the aerospike-client-c.env example.

必填字段：

hosts: Seed hosts as host:port[,host:port...].

可选字段：

namespace (str, default "lmcache"): Aerospike namespace. Must exist on the server and have nsup-period > 0 if you rely on TTL expiry.
set_name / set (str, default "kv_chunks"): Aerospike set name.
num_workers (int, default 8, > 0): C++ worker threads for I/O. This is the real I/O queue depth -- raise it to push throughput.
read_timeout_ms (int, default 1000): Client read timeout.
write_timeout_ms (int, default 2000): Client write timeout.
default_ttl_seconds (int, default 86400): Record TTL. 0 uses the namespace default TTL.
target_segment_bytes (int, default 0): Target shard size. 0 uses the discovered server record cap.
max_record_bytes (int, default 0): Override the server record cap. 0 discovers it at construction time.
username / password (str, default ""): Optional Enterprise Edition authentication.
max_capacity_gb (float, default 0): Maximum L2 capacity in GB for client-side usage tracking / eviction. 0 disables tracking.

Environment variable fallbacks. When the corresponding config value is empty, these environment variables are used: LMCACHE_AEROSPIKE_HOSTS, LMCACHE_AEROSPIKE_NAMESPACE, LMCACHE_AEROSPIKE_SET, LMCACHE_AEROSPIKE_USERNAME, LMCACHE_AEROSPIKE_PASSWORD.

配置示例：

# Basic single-node Community Edition
--l2-adapter '{"type": "aerospike", "hosts": "127.0.0.1:3000", "namespace": "lmcache", "set_name": "kv_chunks", "num_workers": 8}'

# Multi-node seed list with capacity tracking for eviction
--l2-adapter '{"type": "aerospike", "hosts": "10.0.0.1:3000,10.0.0.2:3000", "namespace": "lmcache", "num_workers": 16, "max_capacity_gb": 512}'

# Enterprise Edition with authentication
--l2-adapter '{"type": "aerospike", "hosts": "as.internal:3000", "namespace": "lmcache", "username": "lmcache", "password": "secret"}'

`s3` -- S3 兼容对象存储 #

一个 L2 适配器，使用 AWS 通用运行时 (CRT) 将 KV Cache 对象存储为 S3 对象。支持 AWS S3、S3 Express One Zone 以及任何 S3 兼容的端点（如 MinIO、Ceph RGW 等）。

必填字段：

s3_endpoint: 存储桶 URL -- 可以是 "s3://<bucket>" 或裸主机形式（用于非 AWS 端点）。
s3_region: AWS 区域字符串（例如 "us-west-2"）。

可选字段：

s3_num_io_threads (int, default 64): CRT I/O 线程的数量。
s3_prefer_http2 (bool, default true): 通过 ALPN 协商 HTTP/2。
s3_enable_s3express (bool, default false): 为 S3 Express 单区桶启用 S3 Express 签名。
disable_tls (bool, default false): 在指向普通 HTTP 端点时绕过 TLS（例如，本地 MinIO）。
aws_access_key_id / aws_secret_access_key (字符串): 静态凭证；省略两者以使用 AWS 默认凭证提供程序链（环境、EC2 实例配置文件等）。
max_capacity_gb (浮点数，默认 0.0)：get_usage() 使用的总容量。值为 0 将禁用总的逐出（usage_fraction == -1.0）。

配置示例：

# AWS S3 with default credentials
--l2-adapter '{"type": "s3", "s3_endpoint": "s3://my-bucket", "s3_region": "us-west-2"}'

# Static credentials, HTTP/2 disabled
--l2-adapter '{"type": "s3", "s3_endpoint": "s3://my-bucket", "s3_region": "us-west-2", "s3_prefer_http2": false, "aws_access_key_id": "AKIA...", "aws_secret_access_key": "..."}'

# Local MinIO over plain HTTP
--l2-adapter '{"type": "s3", "s3_endpoint": "minio.local:9000", "s3_region": "us-east-1", "disable_tls": true, "aws_access_key_id": "minio", "aws_secret_access_key": "minio123"}'

`hfbucket` -- Hugging Face Buckets #

An L2 adapter that stores KV cache objects in a Hugging Face Bucket using the huggingface_hub bucket APIs. Blocking Hub calls run on a bounded thread pool driven by an asyncio loop on a daemon thread, so the L2 controller thread is never blocked on network I/O.

Object names are derived from the MP ObjectKey as <model>@<kv_rank_hex>@<chunk_hash_hex>[@<cache_salt>] and then encoded with the standard HFBucket object-name encoding plus the optional bucket prefix. Because Hugging Face batch writes are not transactional, a store task that partially fails reconciles backend metadata so that any objects that actually landed are still counted for usage accounting and later deletion.

This is a persistent remote backend best suited to warm and cold KV cache tiers; prefer a lower-latency local adapter for the hottest cache tier.

必填字段：

bucket_handle: Bucket location in the form hf://buckets/<namespace>/<bucket>[/<prefix>].

可选字段：

token_env (string, default "HF_TOKEN"): Environment variable used to resolve the Hugging Face access token.
token (string): Direct token fallback used when token_env is unset.
create_bucket_if_missing (bool, default false): Create the bucket lazily on the first store instead of requiring it to exist.
download_tmp_dir (string): Root directory for temporary load downloads.
metadata_cache_ttl_secs (float, default 30.0): TTL for the path-size metadata cache that backs lookups and usage accounting.
num_workers (int, default 4): Number of worker threads for blocking Hugging Face Hub API calls.
max_capacity_gb (float, default 0.0): Aggregate capacity used by get_usage(). A value of 0 disables aggregate eviction.
eviction (dict): Optional eviction policy, see L2AdapterConfigBase.

配置示例：

# Minimal: use an existing bucket with a token from $HF_TOKEN
--l2-adapter '{"type": "hfbucket", "bucket_handle": "hf://buckets/my-org/lmcache-kv/prod"}'

# Create the bucket on first store and bound the worker pool
--l2-adapter '{"type": "hfbucket", "bucket_handle": "hf://buckets/my-org/lmcache-kv/prod", "create_bucket_if_missing": true, "num_workers": 8}'

# Enable aggregate eviction with a capacity cap
--l2-adapter '{"type": "hfbucket", "bucket_handle": "hf://buckets/my-org/lmcache-kv/prod", "max_capacity_gb": 50, "eviction": {"eviction_policy": "LRU", "trigger_watermark": 0.9, "eviction_ratio": 0.1}}'

`mock` -- 测试用的模拟适配器 #

模拟具有可配置大小和带宽的 L2 存储。对于在没有真实存储硬件的情况下测试 L2 管道非常有用。

字段：

max_size_gb: 最大大小（以 GB 为单位，> 0）。
mock_bandwidth_gb: 模拟带宽，单位为 GB/秒 (> 0)。

--l2-adapter '{"type": "mock", "max_size_gb": 256, "mock_bandwidth_gb": 10}'

多个适配器（级联）#

您可以通过重复 --l2-adapter 参数来配置多个 L2 适配器。适配器按指定的顺序使用。StoreController 将数据推送到所有配置的适配器，而 PrefetchController 在查找期间按顺序查询适配器。

# SSD (fast, smaller) + NVMe GDS (larger capacity)
--l2-adapter '{"type": "nixl_store", "backend": "POSIX", "backend_params": {"file_path": "/data/ssd/l2", "use_direct_io": "false"}, "pool_size": 64}' \
--l2-adapter '{"type": "nixl_store", "backend": "GDS", "backend_params": {"file_path": "/data/nvme/l2", "use_direct_io": "true"}, "pool_size": 128}'

存储和预取策略 #

存储策略 控制键从 L1 流向 L2 的方式：哪些适配器接收每个键，以及在成功存储到 L2 后是否从 L1 删除键。 预取策略 控制键从 L2 流回 L1 的方式：当多个适配器具有相同的键时，该策略决定哪个适配器加载它。

通过 CLI 选择策略：

--l2-store-policy default \
--l2-prefetch-policy default

内置策略：

标志	名称	行为
`--l2-store-policy`	`default`	将所有键存储到所有适配器。永远不要从 L1 中删除。
`--l2-store-policy`	`skip_l1`	仅缓冲区模式。将所有键存储到所有适配器，然后立即从 L1 中删除它们。与 `--eviction-policy noop` 配对以避免无用的 LRU 开销。
`--l2-prefetch-policy`	`default`	对于每个键，选择第一个（索引最低的）具有该键的适配器。预取的键是临时的（在读取器完成后删除）。
`--l2-prefetch-policy`	`retain`	与 `default` 相同的加载计划，但预取的键在 L1 中永久保留。当预取的数据可能被后续请求重用时（例如共享的系统提示块），这非常有用。

预取并发性 #

--l2-prefetch-max-in-flight 标志限制了 PrefetchController 在任何时候可以同时进行的预取请求数量。更高的值会增加 L2 到 L1 的吞吐量，但也会增加来自在飞数据的 L1 内存压力。

标志	默认	描述
`--l2-prefetch-max-in-flight`	`8`	最大并发预取请求数。

仅缓冲区模式 #

当 L1 仅作为写缓冲区使用（所有数据存储在 L2 中）时，使用 --l2-store-policy skip_l1 和 --eviction-policy noop。此组合会在将键存储到 L2 后立即从 L1 中删除这些键，并完全禁用 LRU 逐出跟踪器，从而减少内存和 CPU 开销。

--eviction-policy noop \
--l2-store-policy skip_l1 \
--l2-prefetch-policy default

策略是可扩展的 -- 可以通过在 storage_controllers/ 中创建文件并在导入时调用 register_store_policy() 或 register_prefetch_policy() 来添加新策略。有关详细信息，请参见设计文档 l2_adapters/design_docs/overall.md。

序列化（压缩 / 量化）#

每个适配器可以选择性地运行一个 **serde**（序列化器/反序列化器），在数据进出 L2 时进行转换——例如，针对磁盘后端的 fp8 量化，或针对远程适配器的加密。有关详细信息和配置，请参见 L2 序列化（Serialization / Deserialization）。

逐出 #

LMCache 支持在两个存储层次上进行逐出，以便每个层次可以在固定的容量预算内运行。

L1 逐出 #

L1 逐出运行一个后台线程，监控整体 L1 内存使用情况。当使用量超过 trigger_watermark 时，逐出策略会逐出一部分最近最少使用的键。

命令行标志：

标志	默认	描述
`--eviction-policy`	(必需)	策略名称：`LRU` 或 `noop`。
`--eviction-trigger-watermark`	`0.8`	触发逐出的 L1 使用比例 [0, 1]。
`--eviction-ratio`	`0.2`	每个周期逐出当前分配的 L1 内存的比例。

示例：

--eviction-policy LRU \
--eviction-trigger-watermark 0.8 \
--eviction-ratio 0.2

L2 逐出 #

L2 逐出是 每个适配器 和 自愿选择 的。每个适配器可以通过在其 --l2-adapter JSON 规范中添加 "eviction" 子对象来独立声明逐出策略。没有 "eviction" 键的适配器没有逐出控制器。

当为适配器启用 L2 逐出时，一个专用的后台线程会监视该适配器的 get_usage() 值。一旦使用量超过 trigger_watermark，该策略将逐出键，直到使用量降低到 eviction_ratio。

``"eviction"`` 子对象字段:

字段	默认	描述
`eviction_policy`	(必需)	策略名称：`"LRU"` 或 `"noop"`。
`trigger_watermark`	`0.8`	触发逐出的适配器使用比例 [0, 1]。
`eviction_ratio`	`0.2`	每个周期逐出的已用容量的比例。

示例 — 使用 LRU 逐出的 nixl_store:

--l2-adapter '{
  "type": "nixl_store",
  "backend": "POSIX",
  "backend_params": {"file_path": "/data/lmcache/l2", "use_direct_io": "false"},
  "pool_size": 128,
  "eviction": {
    "eviction_policy": "LRU",
    "trigger_watermark": 0.8,
    "eviction_ratio": 0.2
  }
}'

适配器支持：

适配器	L2 逐出支持
`nixl_store`	完全支持。`delete` 释放池槽；固定键（正在进行的加载）会被跳过，并在下一个周期重试。
`nixl_store_dynamic`	完全支持。`delete` 从磁盘中删除数据文件；被固定的键会被跳过。`get_usage` 是基于字节的 (`_total_bytes / max_capacity_bytes`)。
`mock`	完全支持。对于测试逐出行为而无需真实存储硬件非常有用。
`raw_block`	完全支持共享/全局逐出。`delete` 回收原始块槽；被锁定的条目会被跳过，并在下一个周期重试。
`s3`	`delete` 从存储桶中删除对象并释放聚合字节计数。当 `max_capacity_gb` 为 0``（禁用）时，``get_usage 报告 `usage_fraction == -1.0`；设置非零的 `max_capacity_gb` 以启用基于水印的逐出控制器。
`hfbucket`	`delete` removes objects from the bucket and frees aggregate byte accounting. `get_usage` reports `usage_fraction == -1.0` when `max_capacity_gb` is `0` (disabled); set a non-zero `max_capacity_gb` to enable the watermark-triggered eviction controller. Locked keys (in-flight loads) are skipped.
`dax`	完全支持。`delete` 会立即从内存索引中移除未锁定的键，并在活动读取借用耗尽后回收固定槽。使用基于槽的方式。
`mooncake_store`	不支持逐出（原生连接器适配器）。
`fs`	不支持逐出（`delete` 和 `get_usage` 是无操作）。
原生连接器	不支持逐出。

备注

每个 L2 适配器实例都有自己独立的逐出控制器和策略。两个相同类型的适配器可以有不同的水位线或策略。

L1 + L2 逐出示例 #

--l1-size-gb 100 \
--eviction-policy LRU \
--eviction-trigger-watermark 0.8 \
--eviction-ratio 0.2 \
--l2-adapter '{
  "type": "nixl_store",
  "backend": "GDS",
  "backend_params": {"file_path": "/data/nvme/l2", "use_direct_io": "true"},
  "pool_size": 256,
  "eviction": {
    "eviction_policy": "LRU",
    "trigger_watermark": 0.9,
    "eviction_ratio": 0.1
  }
}'

在此设置中：

当 L1 的内存使用达到 80% 时，它会逐出内存，每个周期回收 20% 的分配内存。
L2 (NIXL/GDS) 在存储池占用 90% 的槽位时进行逐出，每个周期回收 10%。
两个层次使用独立的 LRU 策略，因此每个层次逐出其自身最近最少使用的键。

验证 L2 存储 #

将 LMCACHE_LOG_LEVEL=DEBUG 设置为在服务器日志中查看 L2 活动：

LMCACHE_LOG_LEVEL=DEBUG lmcache server \
    --l1-size-gb 100 --eviction-policy LRU \
    --l2-adapter '{"type": "nixl_store", "backend": "POSIX", "backend_params": {"file_path": "/data/lmcache/l2", "use_direct_io": "false"}, "pool_size": 64}'

当 L2 活动时预期的日志消息：

LMCache DEBUG: Submitted store task ...
LMCache DEBUG: L2 store task N completed ...
LMCache DEBUG: Prefetch request submitted: X total keys, Y L1 prefix hits, Z remaining for L2