lmcache bench#

lmcache bench 命令用于运行持续性能基准测试。它有三个子命令，分别针对不同的栈层级：

子命令	描述
`engine`	使用多种 KV Cache 重用模式的工作负载对推理引擎（例如 vLLM）进行基准测试。
`server`	针对正在运行的 LMCache MP 缓存服务器（ZMQ + HTTP）的端到端健全性测试。需要完整安装 `lmcache` 以及一块 GPU。
`l2`	针对 L2 缓存适配器的吞吐量／延迟基准测试（存储／查找／加载）。

lmcache bench {engine,server,l2} [options]

engine#

lmcache bench engine 命令对推理引擎（例如 vLLM）执行持续性能基准测试。它支持多种工作负载类型，用于测试不同的缓存模式，并上报 TTFT、解码速度和吞吐量等指标。

lmcache bench engine [options]

配置基准测试有三种方法：

CLI 参数 -- 在命令行中传递所有选项。
交互模式 -- 在不传入必需参数的情况下运行 lmcache bench engine，然后按照逐步提示进行操作。
配置文件 -- 将配置保存为 JSON 并使用 --config 进行重放。

快速开始#

最小示例（包含所有必需参数）：

lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload long-doc-qa \
    --lmcache-url http://localhost:8080

交互模式（引导设置）：

lmcache bench engine

交互模式会引导您逐一完成每项必需设置，然后询问是否需要配置通用选项和工作负载专项选项，或者直接使用默认值。

从保存的配置文件中：

lmcache bench engine --engine-url http://localhost:8000 \
    --config my_bench.json

配置文件包含基准测试参数（工作负载、KV Cache 设置等），但不包含引擎 URL，因此同一份配置可以在不同引擎上复用。

导出配置而不运行基准测试：

lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload long-doc-qa \
    --lmcache-url http://localhost:8080 \
    --export-config my_bench.json

此操作会解析所有自动检测到的值（模型名称、每 GB 的 token 数），并将其保存为可移植的 JSON 文件，无需 LMCache 服务器即可使用。

非交互模式（用于脚本和 CI）：

lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload long-doc-qa \
    --lmcache-url http://localhost:8080 \
    --no-interactive

若缺少任何必需参数，立即报错而非进入交互模式，适用于自动化流水线。

如果您没有 LMCache 服务器，可以直接传递 --tokens-per-gb-kvcache，而不是 --lmcache-url（有关如何找到此值，请参见查找 --tokens-per-gb-kvcache）。

常规选项#

标志	必需的	描述
`--config FILE`	不	从 JSON 文件加载配置，跳过交互模式。命令行参数会覆盖文件中的值。引擎 URL 不存储于配置文件中，需单独提供。
`--export-config FILE`	不	将已解析的配置导出到 JSON 文件后退出，不会执行基准测试。自动检测的值（模型名称、每 GB 的 token 数）会被解析并保存，使配置具有可移植性。特定于环境的值（引擎 URL、LMCache URL）不会被导出。
`--no-interactive`	不	禁用交互模式。若缺少必需参数则立即报错，而非进行提示。适合在脚本和 CI 中使用。
`--engine-url URL`	是	推理引擎 URL（例如 `http://localhost:8000`）。若需要身份验证，请设置 `OPENAI_API_KEY` 环境变量。
`--workload TYPE`	是	工作负载类型：`long-doc-qa`、`multi-round-chat`、`long-doc-permutator`、`prefix-suffix-tuner` 或 `random-prefill`。
`--tokens-per-gb-kvcache N`	*	每 GB KV Cache 的 token 数。除非已设置 `--lmcache-url`，否则此参数为必填项。关于如何获取该值，请参见查找 --tokens-per-gb-kvcache。
`--lmcache-url URL`	不	LMCache HTTP 服务器 URL。若提供此参数，`--tokens-per-gb-kvcache` 将自动从服务器检测。
`--model NAME`	不	模型名称。若省略，将自动从引擎检测。
`--kv-cache-volume GB`	不	目标活跃 KV Cache 容量，单位为 GB（默认值：100）。
`--seed N`	不	随机种子（默认值：42）。
`--output-dir DIR`	不	CSV 和 JSON 输出文件的保存目录（默认：当前目录）。
`--no-csv`	不	跳过 CSV 导出。
`--json`	不	导出 JSON 摘要文件。
`-q` / `--quiet`	不	禁止显示实时进度。

查找 `--tokens-per-gb-kvcache`#

如果已有正在运行的 LMCache 服务器，最简便的方法是传递 --lmcache-url，让工具自动检测该值。

如果正在使用 不带 LMCache 的 vLLM，请在 vLLM 的启动日志中查找以下行：

INFO: Available KV cache memory: 12.34 GiB
INFO: GPU KV cache size: 567,890 tokens

然后计算:

tokens_per_gb = 567890 / 12.34 = 46,020

工作负载#

long-doc-qa#

模拟对长文档的重复问答。预热阶段将每个文档发送一次以填充 KV Cache，然后以信号量控制的并发方式派发基准测试查询。

标志	默认	描述
`--ldqa-document-length`	10000	每个合成文档的 token 长度。
`--ldqa-query-per-document`	2	每个文档的提问数量。
`--ldqa-shuffle-policy`	随机	请求排序方式：`random`（随机打乱）或 `tile`（逐轮轮转）。
`--ldqa-num-inflight-requests`	3	最大并发在途请求数。

示例：

lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload long-doc-qa \
    --lmcache-url http://localhost:8080 \
    --kv-cache-volume 50 \
    --ldqa-document-length 8000 \
    --ldqa-query-per-document 4 \
    --ldqa-shuffle-policy tile

multi-round-chat#

模拟有状态的多轮聊天。创建并发用户会话，以固定 QPS 速率调度请求，并将响应记录到会话历史中，使每个后续查询都包含先前的上下文。

标志	默认	描述
`--mrc-shared-prompt-length`	2000	每个会话的系统提示 token 长度。
`--mrc-chat-history-length`	10000	预填充的聊天历史 token 长度。
`--mrc-user-input-length`	50	每个用户查询的 token 数。
`--mrc-output-length`	200	每个响应最多生成的 token 数。
`--mrc-qps`	1.0	每秒目标查询数。
`--mrc-duration`	60.0	基准测试持续时间，单位为秒。

示例：

lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload multi-round-chat \
    --lmcache-url http://localhost:8080 \
    --mrc-qps 2.0 \
    --mrc-duration 120

long-doc-permutator#

通过发送一组上下文文档的不同排列，对混合 KV Cache 的重用进行压力测试。每个请求以不同顺序拼接所有上下文文档：

[System Prompt] + [Doc_i1] + [Doc_i2] + ... + [Doc_iN]

基准测试阶段开始前会发送一个虚拟预热请求。请求以信号量控制并发方式派发。

标志	默认	描述
`--ldp-num-contexts`	5	独立上下文文档的数量。
`--ldp-context-length`	5000	每个上下文文档的 token 长度。
`--ldp-system-prompt-length`	1000	共享系统提示的 token 长度。使用 `0` 表示无系统提示。
`--ldp-num-permutations`	10	发送的不同排列数量，上限为 N!（其中 N = `--ldp-num-contexts`）。
`--ldp-num-inflight-requests`	1	最大并发在途请求数。

示例：

lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload long-doc-permutator \
    --lmcache-url http://localhost:8080 \
    --ldp-num-contexts 4 \
    --ldp-context-length 8000 \
    --ldp-num-permutations 24 \
    --ldp-num-inflight-requests 2

prefix-suffix-tuner#

一个两遍顺序工作负载，设计为在三种 LMCache 配置下 不做任何修改 地运行，以展示各缓存层级（L0 HBM、L1 DRAM、L2 磁盘）的价值：

基线	LMCache 配置	目标溢出	预期的 pass-2 命中
1	原生 vLLM (仅 L0)	L0 (HBM)	无 —— 每个请求均为冷预填充
2	vLLM + LMCache L1 + L2	L1 (DRAM)	L2 前缀命中（后缀重计算）
3	vLLM + LMCache L1 + L2 + CacheBlend	L1 (DRAM)	L2 前缀命中 + CacheBlend 后缀命中

将 --kv-cache-volume 设置为希望发生溢出的层的大小（基线 1 对应 L0 大小，基线 2 和 3 对应 L1 大小）。各基线的工作负载本身完全相同。

每个请求的布局:

[prefix_i with unique-ID][random breaker][shared suffix]

num_prefixes 个不同的前缀，每个前缀以 PREFIX_<8-hex> 开头，以确保各前缀在池内的 token 化哈希各不相同。
每个请求包含一段新的随机 32-token 分隔符，用于突破普通前缀缓存的前缀边界。
所有请求共享的单个后缀 —— 这是 CacheBlend 唯一能够重用的条目。

第 1 遍（预热）将每个前缀各发送一次以填充缓存，其统计数据被丢弃。第 2 遍以相同顺序再次发送。由于 LRU 在每次第 2 遍访问时会逐出下一个所需前缀，即使目标层仅溢出 1.05 倍，也足以使每个第 2 遍请求在该层未命中并降级到下一层。

标志	默认	描述
`--psf-context-length`	8000	每个请求的总 token 数（前缀 + 分隔符 + 后缀）。
`--psf-prefix-ratio`	0.8	前缀占上下文长度的比例，必须在 (0.0, 1.0) 区间内。剩余部分（减去 32 个 token 的分隔符）即为共享后缀。
`--psf-thrash`	20.0	要溢出的 KV Cache 层的大小（单位：GB）。原生 vLLM 请使用 L0（HBM）大小；分层基线请使用 L1（LMCache DRAM）大小。工作负载的前缀池大小略大于此值（内部溢出 5%），足以在顺序调度 + LRU 条件下使每个 pass-2 请求在该层发生未命中。

pass-2（测量阶段）的请求数等于前缀池大小，计算公式为 floor(psf_thrash * 1.05 * tokens_per_gb / prefix_tokens)。此工作负载不使用 --kv-cache-volume，规模完全由 --psf-thrash 决定。

示例：

lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload prefix-suffix-tuner \
    --lmcache-url http://localhost:8080 \
    --psf-context-length 8000 \
    --psf-prefix-ratio 0.8 \
    --psf-thrash 100

备注

为使分析模型的论断“thrash ≈ L1 大小 → ~0% LMCache 命中率”在实验中成立，LMCache 服务器必须以 --eviction-ratio 0.99 启动（默认值 0.20 每个周期仅清除 20%，第 2 遍时约有 60% 的第 1 遍内容仍留存在缓存中）：

lmcache server --l1-size-gb <SIZE> --eviction-policy LRU \
    --eviction-trigger-watermark 0.80 \
    --eviction-ratio 0.99

工作负载在第 1 遍（预热）和第 2 遍（测量）之间会休眠 5 秒，以便 LMCache 的 1Hz 批量逐出轮询线程有机会实际运行。若没有这段休眠，快速基准测试会在任何逐出触发之前就已完成。

random-prefill#

以 max_tokens=1 同时发送所有请求，用于测量纯预填充性能，无预热阶段。

标志	默认	描述
`--rp-request-length`	10000	每个预填充请求的 token 长度。
`--rp-num-requests`	50	发送的请求总数。

示例：

lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload random-prefill \
    --lmcache-url http://localhost:8080 \
    --rp-request-length 15000 \
    --rp-num-requests 100

交互模式#

当未提供 --engine-url 或 --workload（且未设置 --no-interactive）时，工具将进入交互模式，引导您完成以下四个阶段：

必需设置 —— 引擎 URL、工作负载类型、LMCache 服务器（或每 GB 的 token 数）。
通用设置（可选）—— 模型名称、KV Cache 容量。
工作负载设置（可选）—— 工作负载专项参数。
摘要与操作 —— 审查配置，然后启动基准测试或将配置导出为 JSON 文件。

每个提示聚焦于单一设置。选择类提示使用方向键操作；文本和数字类提示接受键入输入，默认值显示在括号中。

══════════════════════════════════════════════════
 lmcache bench engine -- Interactive Setup
══════════════════════════════════════════════════

Engine URL
  URL of the inference engine.
  [default: http://localhost:8000] >

Workload
  The type of benchmark workload to run.
  Use up/down to navigate, Enter to select.

  * long-doc-qa           Repeated Q&A over long documents
    multi-round-chat       Multi-turn chat with stateful sessions
    long-doc-permutator    Permutations of context documents
    prefix-suffix-tuner    Two-pass tiered KV-cache demonstrator
    random-prefill         Prefill-only requests fired simultaneously

LMCache Server
  Do you have a running LMCache server?
  It can auto-detect KV cache size information.
  [default: Y] (Y/n) >

...

──────────────────────────────────────────────────
 Configuration Summary
──────────────────────────────────────────────────
  Workload:             long-doc-qa
  Model:                Qwen/Qwen3-14B
  Tokens per GB:        6553
  ...
──────────────────────────────────────────────────

What would you like to do?
  * Start benchmark
    Export configuration for later use and exit

当您选择“导出配置”时，所有自动检测到的值（模型名称、每 GB 的 token 数）将被解析并保存为可移植的 JSON 文件。

配置文件#

配置文件存储基准测试参数，但不存储特定于环境的值（如引擎 URL 或 LMCache URL），以便同一份配置可在不同环境下复用。

您可以通过三种方式创建配置文件：

交互模式 -- 在摘要步骤选择“导出配置”。
``--export-config`` -- 从 CLI 解析并导出，而不运行。
手动编写 —— 编写 JSON，键名与 CLI 参数名称对应（将连字符替换为下划线）。

示例配置文件：

{
  "model": "Qwen/Qwen3-14B",
  "workload": "long-doc-qa",
  "tokens_per_gb_kvcache": 6553,
  "kv_cache_volume": 100.0,
  "ldqa_document_length": 10000,
  "ldqa_query_per_document": 2,
  "ldqa_shuffle_policy": "random",
  "ldqa_num_inflight_requests": 3
}

通过 --config 加载该配置文件（引擎 URL 需单独提供）：

lmcache bench engine --engine-url http://localhost:8000 \
    --config my_bench.json

命令行参数会覆盖配置文件中的值，因此可以以基础配置为模板，按需调整个别设置：

# Use saved config but override KV cache volume
lmcache bench engine --engine-url http://localhost:8000 \
    --config my_bench.json --kv-cache-volume 200

输出#

终端（实时进度）#

基准测试运行期间，实时进度面板会展示在途请求数、平均 TTFT、解码速度和吞吐量。可通过 -q 禁止显示。

终端（最终摘要）#

完成后，将打印一个摘要表：

======= Engine Benchmark Result (long-doc-qa) ========
---------------------- Configuration ------------------
Engine URL:                       http://localhost:8000
Model:                            Qwen/Qwen3-14B
Workload:                         long-doc-qa
------------------------- Results ---------------------
Successful requests:              20
Failed requests:                  0
Benchmark duration (s):           31.34
Total input tokens:               200000
Total output tokens:              2560
Input throughput (tok/s):         6381.62
Output throughput (tok/s):        81.69
--------------- Time to First Token -------------------
Mean TTFT (ms):                   313.41
P50 TTFT (ms):                    272.83
P90 TTFT (ms):                    587.21
P99 TTFT (ms):                    837.32
------------------ Decoding Speed ---------------------
Mean decode (tok/s):              48.23
P99 decode (tok/s):               38.55
======================================================

CSV 和 JSON#

bench_results.csv —— 每个请求的指标（TTFT、延迟、解码速度、token 计数）。默认写入；可通过 --no-csv 跳过。
bench_summary.json —— 聚合统计数据，包含百分位数和配置元数据。通过 --json 选项启用。

两个文件均写入 --output-dir 指定的目录（默认：当前目录）。

退出代码#

代码	含义
`0`	所有请求均成功。
`1`	一个或多个请求失败。

server#

lmcache bench server 命令是针对 LMCache MP 模式（多进程）缓存服务器的端到端健全性测试。它通过 ZMQ 连接至正在运行的服务器，对一系列合成请求执行完整的 KV Cache 数据路径，并可选地通过 HTTP API 验证每个块的校验和。

lmcache bench server [options]

与 lmcache bench engine 不同，此命令不需要推理引擎，只需要一个正在运行的 LMCache MP 服务器（ZMQ + HTTP）。GPU 模式还额外需要支持 CUDA 的设备，同时需要完整安装 lmcache（而非轻量级的 lmcache-cli 包）。

它的功能#

对于 [--start, --end) 范围内的每个序列，该工具执行两个阶段：

冷通道 —— LOOKUP 预计未命中，因此生成的 KV 张量将通过 STORE\ 写入服务器。
热通道 —— LOOKUP 预计命中；工具发出 RETRIEVE 并将检索到的 KV 块校验和与原始值进行比对。

完整的 RPC 路径是:

REGISTER_KV_CACHE → GET_CHUNK_SIZE → LOOKUP
  → QUERY_PREFETCH_STATUS → RETRIEVE → STORE
  → END_SESSION

当 --url 指向服务器的 HTTP 端点时，逐块校验和还会与服务器端的计算结果进行交叉核验，生产者与消费者之间的任何不一致都将以醒目的 CHECKSUM MISMATCH 日志行呈现。

快速开始#

在一个终端中启动 MP 服务器：

lmcache server \
    --host localhost --port 15556 \
    --chunk-size 256 --l1-size-gb 5 \
    --eviction-policy LRU --max-workers 1

然后在另一个终端中：

lmcache bench server \
    --rpc-url tcp://localhost:15556 \
    --url http://localhost:8080

默认情况下该工具会持续运行（--end 未设置）；可随时使用 Ctrl-C 停止。传递 --end N 可限定运行次数。

选项#

标志	默认	描述
`--rpc-url URL`	`tcp://localhost:5555`	MP 模式缓存服务器的 ZMQ 端点。
`--url URL`	`http://localhost:8080`	服务器校验和 API 的 HTTP 基础 URL，用于端到端逐块校验和验证。
`--mode {gpu,cpu}`	`gpu`	运行模式。`gpu` 分配真实的 CUDA 张量并使用 CUDA IPC（基于 lmcache 的句柄路径）。`cpu` 默认分配基于 POSIX-SHM 的张量并使用引擎驱动的（工作端聚集/散布）路径。
`--transfer-mode {auto,engine_driven,lmcache_driven}`	`auto`	用于 STORE/RETRIEVE 的传输路由。`lmcache_driven` 强制使用单次处理路径 (`REGISTER_KV_CACHE` + `STORE`/`RETRIEVE`)，支持 CUDA IPC 和 CPU SHM 零拷贝传输。`engine_driven` 强制使用工作端的聚集/分散路径 (`REGISTER_KV_CACHE_ENGINE_DRIVEN_CONTEXT` + `PREPARE`/`COMMIT`)。`auto` 将 gpu 映射到 lmcache_driven，将 cpu 映射到 engine_driven。
`--num-tokens N`	`512`	每个合成请求的 token 数。
`--num-blocks N`	`1024`	在 GPU 上分配的分页 KV 块数量。
`--block-size N`	`16`	每个分页块的 token 数。
`--start N`	`0`	运行的第一个序列号。
`--end N`	(未设置)	序列号的独占上限，省略时循环将无限运行。
`--interval SECS`	`0.5`	连续子通道之间的延迟。
`--kvcache-shape-spec SPEC`	`(2,1024,16,8,128):float16:32`	KV Cache 形状规格（见下文）。
`--format FORMAT`	`terminal`	最终指标摘要的标准输出格式。可用格式：`terminal`、`json`。
`--output PATH`	(未设置)	将最终的指标摘要保存到 PATH 指定的文件中（格式由 `--format` 选择）。
`-q` / `--quiet`	(未设置)	在运行期间抑制所有进度消息。仅输出最终的结构化指标摘要（除非也通过 `--output` 重定向）。

CPU 模式（无 GPU）#

--mode cpu 在没有 GPU 的情况下运行相同的端到端路径。服务器在仅有 CPU 的主机 (StubCPUDevice) 上运行；基准工具分配基于 POSIX-SHM 的 KV 张量并执行完整的 RPC 路径。

默认情况下，--mode cpu 使用引擎驱动的收集/分发路径 (auto → cpu→engine_driven)。要改用零拷贝 SHM 句柄路径，请传递 --transfer-mode lmcache_driven：

# Terminal 1 -- start the LMCache server (no GPU required)
lmcache server \
    --host localhost --port 5555 \
    --l1-size-gb 2 --eviction-policy LRU

# Terminal 2 -- run bench in CPU + lmcache_driven mode
lmcache bench server \
    --rpc-url tcp://localhost:5555 \
    --url http://localhost:8080 \
    --mode cpu --transfer-mode lmcache_driven \
    --start 0 --end 2

KV 缓存形状规格#

--kvcache-shape-spec 参数描述 KV 张量在 GPU 上的布局，规格由一个或多个以 ; 分隔的组构成：

(kv_size,NB,BS,NH,HS):dtype:layers[;(...):dtype:layers...]

字段：

kv_size —— 经典注意力机制（独立 K/V）为 2，MLA 为 1。
NB —— 分页块数量。
BS —— 块大小（每块的 token 数）。
NH —— 每层的注意力头数量。
HS —— 注意力头大小（以元素为单位）。
dtype —— 元素数据类型（例如 float16、bfloat16、float32、uint8），完整集合与 lmcache/v1/kv_layer_groups.py 中 DTYPE_MAP 的键一致。
layers —— 该组包含的层数。

多组规格支持对异构层建模（例如，同一模型中同时包含 MLA 层和经典注意力层）：

lmcache bench server \
    --rpc-url tcp://localhost:15556 \
    --kvcache-shape-spec "(1,1024,16,1,128):float16:4;(2,1024,16,8,128):float16:28"

所有组必须共享相同的 NB 和 BS（这是分页 KV 的物理约束），各组层数之和等于向服务器注册的总层数。

完整的解析规则和验证错误说明请参阅 lmcache/v1/kv_layer_groups.py 中的 parse_kvcache_shape_spec。

Profiling the server#

lmcache bench server is a ZMQ client: the store path it exercises (hashing, allocation, gather, D2H) runs inside the server process, not this benchmark. --flamegraph on therefore attaches the profiler to a server pid you supply, records for the duration of the load, and renders a flame graph of the server, not of the client.

lmcache bench server \
    --rpc-url tcp://localhost:5555 \
    --start 0 --end 200 --interval 0.02 \
    --flamegraph on --flamegraph-mode gil \
    --profile-server-pid "$(pgrep -f 'lmcache server')"

--flamegraph-mode takes the same six values documented under lmcache tool flamegraph (or several comma-separated to drive the load once per mode, one SVG each). Because the target is a separate, already-running server (not a process this benchmark spawns), it profiles by attaching, so the same attach-mode caveats documented for lmcache tool flamegraph apply here: what each mode shows, the PYTHONPERFSUPPORT=1 requirement for naming Python frames in the perf/bcc modes, the container privileges each mode needs, and the fact that recording a live process is never free.

备注

The one thing unique to bench server: it records while it drives load, so the recording overhead lands on the very throughput/latency this benchmark reports. Keep the profiled run short and read those numbers as indicative, not a clean baseline.

输出#

运行完成后（或被 Ctrl-C 中断后），将打印结构化的指标摘要，内容包括：

配置 —— RPC URL、模式、传输模式、每请求 token 数、间隔。
结果 —— 总请求数、校验和 OK／FAIL 计数、通过率。
延迟部分 —— 每个操作的延迟统计信息（count、mean、min、max、p50、p99），涵盖冷查找、冷存储、热查找和热检索。

使用 --format json 获取机器可读的输出，或使用 --output FILE 将摘要保存到文件中。

================ Server Bench Result =================
---------------------- Configuration -----------------
RPC URL:                          tcp://localhost:15556
Mode:                             gpu
Transfer mode:                    auto
Tokens / request:                 512
Interval (s):                     0.5
------------------------- Results --------------------
Total requests:                   3
Checksum OK:                      3
Checksum FAIL:                    0
Pass rate (%):                    100.0
-------------------- Cold Lookup (ms) ---------------
count:                            3
mean:                             1.647
min:                              1.312
max:                              1.823
p50:                              1.647
p99:                              1.823
--------------------- Cold Store (ms) ---------------
count:                            3
mean:                             1.740
min:                              1.521
max:                              1.982
p50:                              1.740
p99:                              1.982
-------------------- Warm Lookup (ms) ---------------
count:                            3
mean:                             1.310
min:                              1.102
max:                              1.512
p50:                              1.310
p99:                              1.512
------------------- Warm Retrieve (ms) --------------
count:                            3
mean:                             1.480
min:                              1.321
max:                              1.612
p50:                              1.480
p99:                              1.612
=====================================================

示例输出（进度）#

运行期间，进度消息会打印到 stdout（可通过 -q / --quiet 禁止）：

Connecting to LMCache MP Server at tcp://localhost:15556 (mode=gpu) ...
Server chunk_size = 256
Resolved KV shape spec: (2,1024,16,8,128):float16:32
[seq=0] LOOKUP cold:  0/2 chunks hit (1.82 ms)
[seq=0] STORE:        2 chunks stored (1.74 ms)
[seq=0] LOOKUP warm:  2/2 chunks hit (1.31 ms)
[seq=0] RETRIEVE:     2 chunks retrieved (1.48 ms)
[seq=0] CHECKSUM MATCH OK
[seq=1] ...

日志中出现任何 CHECKSUM MISMATCH、ERROR 或 Python 回溯，均表明存在值得调查的真实问题。

退出代码#

代码	含义
`0`	测试循环正常完成（或通过 Ctrl-C 干净中断），无校验和不匹配。
`1`	致命错误（例如，`--mode gpu` 时 CUDA 不可用、服务器无法访问或校验和不匹配）。

l2#

lmcache bench l2 命令通过与 LMCache 生产环境相同的 parse_args_to_l2_adapters_config + create_l2_adapter 流水线，对 L2 缓存适配器（例如本地文件系统适配器）进行端到端基准测试。任何已注册的适配器类型均可在无需修改代码的情况下进行测试：只需用一个 JSON 规范描述该适配器，并选择要执行的操作即可。

lmcache bench l2 [options]

与 lmcache bench engine 不同，此命令不需要推理引擎或 LMCache MP 服务器，只需适配器的后端存储可访问（对于 fs 适配器，只需一个可写目录）。

它的功能#

对于每个待测操作，该工具通过适配器的公共 submit/wait API 直接驱动适配器：

Store -- submit_store_task 每次提交写入 num_keys 个 MemoryObj，并等待存储 eventfd。
Lookup -- submit_lookup_and_lock_task 检查键是否存在（不传输负载），并等待查找 eventfd。
Load -- submit_load_task 每次提交读取 num_keys 个 MemoryObj，并等待加载 eventfd。

每个测量轮次从单个生产者线程顺序发出 --in-flight 次提交，然后等待全部完成；轮次持续时间为从第一次提交到最后一次完成的墙钟时间。预热轮次在测量前运行，其结果不计入最终摘要。

三种操作共享相同的键索引空间，因此在其他参数不变的情况下先运行 --only store 再运行 --only load（或 --only lookup）会访问完全相同的键。这使基准测试可作为适配器的快速回归测试，验证其是否支持完整的存储 -> 加载往返。

备注

若未指定 --only，三个操作将在 单个进程中按顺序 执行：store -> lookup -> load。对于后端存储位于操作系统级缓存之后的适配器——尤其是受 Linux 页面缓存 影响的本地文件系统（fs）适配器——lookup 和 load 几乎总能观察到 store 刚写入的数据仍驻留在 RAM 中，因此上报数字反映的是页面缓存吞吐量而非底层设备吞吐量。

若要在冷缓存条件下对各操作分别进行基准测试，请使用 --only 单独运行，并在两次运行之间清除操作系统缓存，例如:

lmcache bench l2 --l2-adapter '...' --only store
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
lmcache bench l2 --l2-adapter '...' --only lookup
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
lmcache bench l2 --l2-adapter '...' --only load

对于绕过页面缓存的适配器（例如设置了 "use_odirect": true 的 fs 适配器）或与无本地缓存的远程服务通信的适配器，直接使用默认的组合运行即可。

O_DIRECT 适配器可能还要求基准测试的 L1 缓冲区满足适配器的块对齐要求。可通过 --l1-align-bytes 设置对齐值，本地块设备通常为 4096。负载大小（--data-size-kb * 1024）必须是该对齐值的整数倍。

快速开始#

使用默认参数对本地文件系统适配器进行基准测试：

lmcache bench l2 \
    --l2-adapter '{"type":"fs","base_path":"/tmp/lmcache-bench"}'

将运行全部三个操作（store、lookup、load），包括一次预热轮次和一次测量轮次。

使用更多并发提交和更大负载对适配器施压：

lmcache bench l2 \
    --l2-adapter '{"type":"fs","base_path":"/data/lmcache-bench","relative_tmp_dir":"tmp"}' \
    --num-keys 32 --in-flight 4 \
    --data-size-kb 512 \
    --rounds 5 --warmup-rounds 1

对使用对齐 L1 缓冲区的 O_DIRECT 适配器进行基准测试：

lmcache bench l2 \
    --l2-adapter '{"type":"raw_block","device_path":"/dev/nvme0n1","slot_bytes":4194304,"use_odirect":true,"block_align":4096}' \
    --data-size-kb 1024 \
    --l1-align-bytes 4096

仅运行单个操作（用于单独测量 store 与 load 的吞吐量）：

lmcache bench l2 \
    --l2-adapter '{"type":"fs","base_path":"/tmp/lmcache-bench"}' \
    --only store

使用受控命中率执行查找（基准测试将查找键拆分为可能存在的范围和保证不存在的范围）：

lmcache bench l2 \
    --l2-adapter '{"type":"fs","base_path":"/tmp/lmcache-bench"}' \
    --only lookup --lookup-max-hit-rate 0.5

在最后一轮测量中启用 store -> load 往返数据完整性校验：

lmcache bench l2 \
    --l2-adapter '{"type":"fs","base_path":"/tmp/lmcache-bench"}' \
    --no-skip-verify

如果希望不在命令行中暴露 JSON 规范，可设置 L2_ADAPTER_JSON 环境变量，而无需传递 --l2-adapter：

export L2_ADAPTER_JSON='{"type":"fs","base_path":"/tmp/lmcache-bench"}'
lmcache bench l2 --num-keys 32 --in-flight 2

选项#

标志	默认	描述
`--l2-adapter JSON`	(未设置)	L2 适配器规范，以 JSON 格式提供，包含 `"type"` 字段及适配器专项配置，例如 `'{\"type\":\"fs\",\"base_path\":\"/tmp/bench\"}'`。可多次传递；仅对第一个规范进行基准测试。若未提供，则退回到 `L2_ADAPTER_JSON` 环境变量。该参数或环境变量二者之一必须指定。
`--num-keys N`	`32`	每次提交的键数。
`--in-flight N`	`1`	每轮的在途提交数。每轮从单个生产者线程按序发出指定数量的提交，然后等待全部完成。
`--data-size-kb N`	`256`	每个键的数据大小，单位为 KiB。
`--l1-align-bytes N`	`1`	基准测试 L1 缓冲区的对齐字节数。测试 O_DIRECT 后端时，该值应不小于适配器的块对齐要求，本地块设备通常为 `4096`。`--data-size-kb * 1024` 必须是该值的整数倍。
`--rounds N`	`1`	每个操作的测量轮数。
`--warmup-rounds N`	`1`	在测量前运行的预热轮次，其结果不计入统计。
`--lookup-max-hit-rate F`	`0.0`	查找命中率的上限，范围为 `[0, 1]`。基准测试请求 `floor(N * rate)` 个来自可能存在范围的键，以及 `N - hit` 个来自保证不存在范围的键，其中 `N` 为查找键总数。若这些键在本次运行中从未被存储过，则实际命中率可能更低。
`--skip-verify` / `--no-skip-verify`	`--skip-verify`	跳过 store -> load 往返数据完整性检查（默认行为）。传递 `--no-skip-verify` 可在最后一轮测量中启用校验，此时要求同时执行 `store` 和 `load`。
`--only {lookup,store,load}`	(未设置)	仅运行指定操作。省略时，按 `store -> lookup -> load` 顺序运行全部三个操作。
`--flamegraph {开,关}`	`关`	捕获测量阶段的火焰图（`on`）或正常运行基准测试（`off`）。当设置为``on``时，基准测试会对自身进行分析并渲染 SVG。默认的``off``不会改变基准测试的行为。请参见 Profiling / flame charts。
`--flamegraph-mode {on-cpu,off-cpu,wakeup,offwake,wall,gil}`	`on-cpu`	Flame-graph mode for `--flamegraph on`. `on-cpu` shows where CPU time goes; `off-cpu` shows time blocked on I/O / locks (best for I/O-bound adapters); `offwake` adds the waker stack to each blocked stack; `wakeup` shows the stacks doing the waking. `wall` and `gil` (`py-spy`) split the chart per thread: wall-clock time, and time holding the interpreter lock.
`--flamegraph-output PATH`	(自动)	SVG 输出路径。默认: `/tmp/lmcache_bench_flames/<adapter>.<mode>.svg`。
`--flamegraph-scripts-dir DIR`	(~/FlameGraph)	包含 FlameGraph 脚本的目录（`flamegraph.pl`，`stackcollapse-perf.pl`）。

适配器 JSON 规范#

--l2-adapter JSON 由 lmcache.v1.distributed.l2_adapters.config.parse_args_to_l2_adapters_config 解析，这也是 LMCache 在其他地方使用的同一入口点。最小必填字段为 type，其余字段均以关键字参数形式转发给适配器实现。

本地文件系统适配器示例：

{
  "type": "fs",
  "base_path": "/data/lmcache-bench",
  "relative_tmp_dir": "tmp",
  "read_ahead_size": null,
  "use_odirect": false
}

适配器类型及其支持字段的完整列表，请参阅 lmcache/v1/distributed/l2_adapters/ 目录下的源代码。

示例输出#

每轮进度（可通过 -q 禁止）：

============================================================
L2 Adapter Benchmark
============================================================
  Adapter config         : FSL2AdapterConfig
  L2 adapter JSON        : {"type":"fs","base_path":"/data/lmcache-bench","relative_tmp_dir":"tmp"}
  Keys / submit          : 32
  In-flight / round      : 3
  Keys / round           : 96
  Data size / key        : 256 KB
  Data / round           : 24.00 MB
  Rounds                 : 1 (+ 1 warmup)
  Lookup max hit rate    : 0.00%
============================================================

[Init] Creating adapter...
[Init] Adapter created successfully (FSL2Adapter).

[Store] Running 1 warmup + 1 measurement rounds...
  [Store] Round 1: 47.83 ms, success_keys=96/96
  [Store] Round 2: 46.19 ms, success_keys=96/96

[Lookup] Running 1 warmup + 1 measurement rounds...
  [Lookup] Round 1:  5.36 ms, found=96/96
  [Lookup] Round 2:  5.03 ms, found=96/96

[Load] Running 1 warmup + 1 measurement rounds...
  [Load] Round 1: 18.15 ms, loaded=96/96
  [Load] Round 2: 17.63 ms, loaded=96/96

最终摘要（每个操作对应一个部分）：

====== L2 Adapter Benchmark Result (FSL2Adapter) =======
----------------------- Configuration -------------------
Adapter:                          FSL2Adapter
Keys / submit:                    32
In-flight / round:                3
Data size / key (KB):             256
Measurement rounds:               1
Warmup rounds:                    1
Lookup max hit rate:              0.0
--------------------------- Store -----------------------
Operation:                        Store
Rounds:                           1
Keys / round:                     96
Total keys:                       96
Total success:                    96
Duration avg (ms):                46.19
...
Throughput avg (MB/s):            519.62
Avg ops/s:                        2078.50
Avg latency / key (ms):           0.481
--------------------------- Lookup ----------------------
...
---------------------------- Load -----------------------
...
=========================================================

每个操作部分报告每轮持续时间统计（avg / min / max / p50 / p99 / std）、汇总吞吐量（avg_throughput_mbps，Lookup 为 0，因其无负载传输）、平均键速率（avg_ops_per_sec）及每键延迟。

对于 Lookup，当 --lookup-max-hit-rate 非零或存在命中键时，还会上报以下三个额外字段：

Expected max hit rate —— 已配置的命中率上限。
Expected hit keys —— floor(total_keys * rate)，仅对测量轮次进行缩放。
Actual hit rate —— 在有效测量轮次中实测的命中率。

往返验证#

当传递 --no-skip-verify 且同时运行了 store 和 load 时，基准测试会将最后一轮测量的加载缓冲区与 store 写入的字节模式进行比对（参见 lmcache/cli/commands/bench/l2_adapter_bench/data.py 中的 make_memory_objects）：

[Verify] Checking store -> load data integrity for last measured round...
[Verify] OK

默认情况下，验证是关闭的，因为更严格的字节模式要求存储和加载对象批次均保持常驻内存，以便将加载的数据与原始存储模式进行比对。

性能分析 / 火焰图#

When --flamegraph on is passed, the benchmark profiles its own process (the L2 adapter driven by this microbenchmark's synthetic load) and renders a flame graph of the measured phases (to profile a separate server or a real process instead, use lmcache tool flamegraph):

lmcache bench l2 \
    --l2-adapter '{"type":"fs","base_path":"/data/lmcache-bench"}' \
    --rounds 300 --flamegraph on --flamegraph-mode on-cpu
#   [Profile] on-cpu recording started (pid=12345) -> .../FSL2Adapter.oncpu.svg
#   [Profile] wrote /tmp/lmcache_bench_flames/FSL2Adapter.oncpu.svg

The --flamegraph-mode values, cost of recording, and tool / sysctl requirements are documented under lmcache tool flamegraph (or pass several comma-separated to profile one benchmark run per mode, one SVG each). What is specific to bench l2:

It self-profiles, so on CPython 3.12+ it activates the perf trampolines itself and adapter functions resolve as py::<qualname> in the on-cpu / off-cpu charts with no PYTHONPERFSUPPORT needed (an attached server cannot). Trampolines cost a few percent, so treat a profiled run's timings as indicative.
The recorder runs as a child of the benchmark, so wall / gil need kernel.yama.ptrace_scope at 0 (not the attach-mode permissions).
Recording covers only the measured work, so use a large --rounds; too short a run captures no samples.

The SVG is written to --flamegraph-output (default /tmp/lmcache_bench_flames/<adapter>.<mode>.svg).

退出代码#

代码	含义
`0`	所有请求操作均已完成，且（在启用时）往返验证通过。
`1`	适配器创建失败、往返验证失败，或某个操作遭遇致命错误（例如所有轮次均超时）。
`2`	无效的调用：缺少或无法解析 `--l2-adapter` JSON / `L2_ADAPTER_JSON` 环境变量，某个选项值无效，或者请求了 `--flamegraph on` 但分析工具链不可用。

lmcache bench#

engine#

快速开始#

常规选项#

查找 --tokens-per-gb-kvcache#

工作负载#

long-doc-qa#

multi-round-chat#

long-doc-permutator#

prefix-suffix-tuner#

random-prefill#

交互模式#

配置文件#

输出#

终端（实时进度）#

终端（最终摘要）#

CSV 和 JSON#

退出代码#

server#

它的功能#

快速开始#

选项#

CPU 模式（无 GPU）#

KV 缓存形状规格#

Profiling the server#

输出#

示例输出（进度）#

退出代码#

l2#

它的功能#

快速开始#

选项#

适配器 JSON 规范#

示例输出#

往返验证#

性能分析 / 火焰图#

退出代码#

查找 `--tokens-per-gb-kvcache`#