lmcache describe#

lmcache describe 命令显示正在运行的服务的详细状态。支持两个目标：

kvcache — LMCache KV Cache 服务（健康状况、L1 存储、注册的模型、L2 适配器）。
engine — 推理引擎 (vLLM) 与 LMCache 配对的 (模型、上下文窗口、健康状况、正在处理的请求)。

KV Cache 服务 (`kvcache`)#

lmcache describe kvcache --url http://localhost:8000

============ LMCache KV Cache Service ============
Health:                                         OK
URL:                         http://localhost:8000
Engine type:                           BlendEngine
Chunk size:                                    256
L1 capacity (GB):                            60.00
L1 used (GB):                        42.30 (70.5%)
Eviction policy:                               LRU
Cached objects:                               1024
Active sessions:                                 3
---- Model: meta-llama/Llama-3.1-70B-Instruct ----
Model:           meta-llama/Llama-3.1-70B-Instruct
World size:                                      4
GPU IDs:                                0, 1, 2, 3
Num layers:                                     80
Num blocks:                                   2048
Cache size per token (bytes):               327680
--- Kernel group 0 (meta-llama/Llama-3.1-70B-Instruct) ---
Kernel group index:                              0
Engine group index:                              0
Object group index:                              0
Num layers:                                     80
Slots per block:                               128
Dtype:                               torch.float16
MLA:                                         False
Attention backend:    vLLM non-MLA flash attention
Engine KV shape:          NL x [2, NB, BS, NH, HS]
Engine KV tensor shape: 80 x [2, 2048, 128, 8, 128]
------------- L2: NixlStoreL2Adapter -------------
Type:                           NixlStoreL2Adapter
Health:                                         OK
Backend:                                 nixl_rdma
Stored objects:                                512
Pool used:                       480 / 512 (93.8%)
==================================================

输出显示：

概述 — 健康状态、引擎类型、块大小。
L1 存储 — 容量、使用情况、逐出策略、缓存对象数量。
注册模型 — 每个模型的 KV Cache 布局：一个上下文范围的摘要，后面跟着每个内核组的一个内核组部分，每个部分都有引擎 KV 张量形状（符号和具体）、注意力后端和组几何。
L2 适配器 — 类型、健康状况、后端、存储对象和利用率。

推理引擎 (`engine`)#

describe engine 检查 vLLM 推理引擎，而不是 LMCache 服务，仅读取引擎自身的 HTTP 端点（/v1/models、/health、/metrics）。

lmcache describe engine --url http://localhost:8000

================ Inference Engine ================
Model:                  meta-llama/Llama-3.1-8B-Instruct
Max context (tokens):   131072
Status:                 OK
Running requests:       3
==================================================

输出显示：

模型和 最大上下文 — 提供的模型 ID 及其最大上下文长度，来自 /v1/models。
状态 — OK / UNHEALTHY 来自引擎的 /health 探针。
运行请求 — 正在进行的请求，从 vllm:num_requests_running 指标中汇总。如果指标被禁用或无法访问，则显示 N/A。

仅需要 /v1/models 获取：如果 /health 或 /metrics 不可用，命令仍会报告它能提供的信息，而不是失败。

lmcache describe engine --url http://localhost:8000 --format json

{
  "title": "Inference Engine",
  "metrics": {
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "max_context": 131072,
    "status": "OK",
    "running_requests": 3
  }
}

选项#

标志	描述
`target`	描述内容（位置参数，必需）：`kvcache` 或 `engine`。
`--url`	服务器 URL。默认值根据目标而定：`kvcache` 的默认值为 `http://localhost:8080`，`engine` 的默认值为 `http://localhost:8000`。
`--format`	输出格式：`terminal`（默认）或`json`。
`--output PATH`	将指标保存到文件中（格式遵循 `--format`）。
`-q` / `--quiet`	抑制标准输出。仅返回退出代码。

JSON 输出#

使用 --format json 以机器可读的输出格式。模型、内核组和 L2 适配器被收集到列表中，以便于程序化访问：

lmcache describe kvcache --url http://localhost:8000 --format json

{
  "title": "LMCache KV Cache Service",
  "metrics": {
    "health": "OK",
    "url": "http://localhost:8000",
    "engine_type": "BlendEngine",
    "chunk_size": 256,
    "l1_capacity_gb": 60.0,
    "l1_used_gb": "42.30 (70.5%)",
    "eviction_policy": "LRU",
    "cached_objects": 1024,
    "active_sessions": 3,
    "models": [
      {
        "model": "meta-llama/Llama-3.1-70B-Instruct",
        "world_size": 4,
        "gpu_ids": "0, 1, 2, 3",
        "num_layers": 80,
        "num_blocks": 2048,
        "cache_size_per_token": 327680
      }
    ],
    "kernel_groups": [
      {
        "model": "meta-llama/Llama-3.1-70B-Instruct",
        "kernel_group_idx": 0,
        "engine_group_idx": 0,
        "object_group_idx": 0,
        "num_layers": 80,
        "slots_per_block": 128,
        "dtype": "torch.float16",
        "is_mla": false,
        "attention_backend": "vLLM non-MLA flash attention",
        "engine_kv_shape": "NL x [2, NB, BS, NH, HS]",
        "engine_kv_concrete_shape": "80 x [2, 2048, 128, 8, 128]"
      }
    ],
    "l2_adapters": [
      {
        "type": "NixlStoreL2Adapter",
        "health": "OK",
        "backend": "nixl_rdma",
        "stored_object_count": 512,
        "pool_used": "480 / 512 (93.8%)"
      }
    ]
  }
}

引擎 KV 形状缩写#

engine_kv_shape 字段使用来自 EngineKVFormat 枚举的短名称：

缩写	含义
注意事项	num_blocks
NL	num_layers
批量大小	块大小
NH	头数
HS	头部大小
PBS	页面缓冲区大小 (NB × BS)

lmcache describe#

KV Cache 服务 (kvcache)#

推理引擎 (engine)#

选项#

JSON 输出#

引擎 KV 形状缩写#

KV Cache 服务 (`kvcache`)#

推理引擎 (`engine`)#