CLI Reference#

LMCache provides a unified lmcache command-line interface for interacting with KV cache servers, running benchmarks, and inspecting cache state.

lmcache <command> [options]

Installation#

The lmcache CLI is available in two packages:

Package	Install	When to use
`lmcache`	`pip install lmcache`	Full install: server, CLI, and CUDA extensions. Requires Linux + GPU.
`lmcache-cli`	`pip install lmcache-cli`	CLI only: query, ping, bench, describe. No GPU required, any OS.

Note

Do not install both packages in the same environment — they both provide the lmcache entry point.

Quick Start#

After installing LMCache, the lmcache command is available:

# Show available commands
lmcache -h

# Check if the KV cache server is alive
lmcache ping kvcache

# Launch the LMCache server (ZMQ + HTTP)
lmcache server --host 0.0.0.0 --port 5555 --l1-size-gb 100 --eviction-policy LRU

# Run a benchmark against the engine
lmcache bench engine --engine-url http://localhost:8000 \
    --workload long-doc-qa --lmcache-url http://localhost:8080

# JSON on stdout (for scripts)
lmcache ping kvcache --format json

# Save metrics to a file (format follows --format, default: terminal)
lmcache describe kvcache --format json --output status.json

Available Commands#

Command	Description
`describe`	Show detailed status of a running LMCache service, including cache health, L1 storage, registered models, and L2 adapters.
`query`	Single-shot query interface for both the serving engine and KV cache worker.
`server`	Launch the LMCache server (ZMQ + HTTP). Requires full `lmcache` install.
`ping`	Liveness check for LMCache or vLLM servers.
`bench`	Run sustained performance benchmarks against an inference engine.
`kvcache`	Manage KV cache state (e.g. clear L1 cache) on a running server.

`bench` — Engine Benchmarking#

Run sustained benchmarks against an inference engine with multiple workload types:

# Minimal: all required args on the command line
lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload long-doc-qa \
    --lmcache-url http://localhost:8080

# Interactive mode: guided step-by-step setup
lmcache bench engine

# From a saved config file (engine URL provided separately)
lmcache bench engine --engine-url http://localhost:8000 \
    --config my_bench.json

# Export config for later reuse (resolves auto-detected values)
lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload long-doc-qa \
    --lmcache-url http://localhost:8080 \
    --export-config my_bench.json

# Non-interactive mode for scripts/CI (errors if args missing)
lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload long-doc-qa \
    --lmcache-url http://localhost:8080 \
    --no-interactive

Five workloads are available:

long-doc-qa – repeated Q&A over long documents (tests KV cache reuse).
multi-round-chat – multi-turn chat with stateful sessions.
long-doc-permutator – permutations of context documents (blended cache reuse stress test).
prefix-suffix-tuner – two-pass sequential workload demonstrating tiered KV-cache reuse (L0/L1/L2).
random-prefill – prefill-only requests fired simultaneously.

See lmcache bench engine for full documentation including all workload options, interactive mode details, and config file format.

`describe` — Service Status Dashboard#

Inspect the state of a running LMCache KV cache server:

lmcache describe kvcache --url http://localhost:8000

============ LMCache KV Cache Service ============
Health:                                         OK
URL:                         http://localhost:8000
Engine type:                           BlendEngine
Chunk size:                                    256
L1 capacity (GB):                            60.00
L1 used (GB):                        42.30 (70.5%)
Eviction policy:                               LRU
Cached objects:                               1024
Active sessions:                                 3
---- Model: meta-llama/Llama-3.1-70B-Instruct ----
Model:           meta-llama/Llama-3.1-70B-Instruct
World size:                                      4
GPU IDs:                                0, 1, 2, 3
Attention backend:    vLLM non-MLA flash attention
GPU KV shape:             NL x [2, NB, BS, NH, HS]
GPU KV tensor shape:   80 x [2, 2048, 128, 8, 128]
Num layers:                                     80
Block size:                                    128
Hidden dim size:                              1024
Dtype:                               torch.float16
MLA:                                         False
Num blocks:                                   2048
------------- L2: NixlStoreL2Adapter -------------
Type:                           NixlStoreL2Adapter
Health:                                         OK
Backend:                                 nixl_rdma
Stored objects:                                512
Pool used:                       480 / 512 (93.8%)
==================================================

The output shows:

Overview — health status, engine type, chunk size.
L1 storage — capacity, usage, eviction policy, cached object count.
Registered models — per-model KV cache layout including the GPU KV tensor shape (symbolic and concrete), attention backend, and layer details.
L2 adapters — type, health, backend, stored objects, and utilization.

Arguments#

Flag	Description
`kvcache`	Target to describe (currently only `kvcache` is supported).
`--url`	LMCache HTTP server URL (default: `http://localhost:8080`).
`--format`	Output format: `terminal` (default) or `json`.
`--output PATH`	Save metrics to a file (format follows `--format`).

JSON Output#

Use --format json for machine-readable output. Models and L2 adapters are collected into lists for easy programmatic access:

lmcache describe kvcache --url http://localhost:8000 --format json

{
  "title": "LMCache KV Cache Service",
  "metrics": {
    "health": "OK",
    "url": "http://localhost:8000",
    "engine_type": "BlendEngine",
    "chunk_size": 256,
    "l1_capacity_gb": 60.0,
    "l1_used_gb": "42.30 (70.5%)",
    "eviction_policy": "LRU",
    "cached_objects": 1024,
    "active_sessions": 3,
    "models": [
      {
        "model": "meta-llama/Llama-3.1-70B-Instruct",
        "world_size": 4,
        "gpu_ids": "0, 1, 2, 3",
        "attention_backend": "vLLM non-MLA flash attention",
        "gpu_kv_shape": "NL x [2, NB, BS, NH, HS]",
        "gpu_kv_concrete_shape": "80 x [2, 2048, 128, 8, 128]",
        "num_layers": 80,
        "block_size": 128,
        "hidden_dim_size": 1024,
        "dtype": "torch.float16",
        "is_mla": false,
        "num_blocks": 2048
      }
    ],
    "l2_adapters": [
      {
        "type": "NixlStoreL2Adapter",
        "health": "OK",
        "backend": "nixl_rdma",
        "stored_object_count": 512,
        "pool_used": "480 / 512 (93.8%)"
      }
    ]
  }
}

GPU KV Shape Abbreviations#

The gpu_kv_shape field uses short names from the GPUKVFormat enum:

Abbrev	Meaning
NB	num_blocks
NL	num_layers
BS	block_size
NH	num_heads
HS	head_size
PBS	page_buffer_size (NB × BS)

`query`#

The query engine subcommand sends one request to the engine API and reports metrics. --prompt supports placeholders: {lmcache} loads lmcache/cli/documents/lmcache.txt, and custom documents can be passed with --documents NAME=PATH. The prompt token count is taken directly from the usage data reported by the engine (stream_options: {include_usage: true}).

 lmcache query engine --url http://localhost:8000/v1 \
   --prompt "{lmcache} Summarize LMCache usage." \
   --format terminal \
   --max-tokens 128

================= Query Engine =================
Model:                         facebook/opt-125m
Input tokens:                                618
--------------- Latency Metrics ----------------
Output tokens:                                 9
TTFT (ms):                                 26.88
TPOT (ms/token):                            0.91
Total latency (ms):                        35.05
Throughput (tokens/s):                   1100.64
================================================

`ping` — Liveness Check#

Check whether an LMCache KV cache server or a vLLM serving engine is reachable:

# Ping the KV cache server (default: http://localhost:8080)
lmcache ping kvcache

# Ping the serving engine (default: http://localhost:8000)
lmcache ping engine --url http://localhost:8000

======= Ping KV Cache ========
Status:                   OK
Round trip time (ms):     3.42
==============================

Options#

Flag	Description
`kvcache` \| `engine`	Target to ping (positional, required).
`--url`	Server URL. Defaults to `http://localhost:8080` for `kvcache`, `http://localhost:8000` for `engine`.
`--format`	Output format: `terminal` (default) or `json`.
`--output PATH`	Save metrics to a file (format follows `--format`).
`-q` / `--quiet`	Suppress stdout output. Exit code only.

Exit Codes#

Code	Meaning
`0`	Server is reachable (HTTP 200).
`1`	Connection failure or non-200 response.

`kvcache` — KV Cache Management#

Manage KV cache state on a running LMCache server. See lmcache kvcache for full documentation including examples, options, and common patterns.

Quick example:

# Clear all L1 (CPU) cache
lmcache kvcache clear --url http://localhost:8000

Metrics Output#

All commands that produce metrics support two output formats:

Terminal Output#

Human-readable ASCII table:

======= Ping KV Cache ========
Status:                   OK
Round trip time (ms):     3.42
==============================

JSON Output#

Machine-readable output with structured keys, available via --format json (stdout) or --output (file):

lmcache ping kvcache --format json

{
  "title": "Ping KV Cache",
  "metrics": {
    "status": "OK",
    "round_trip_time_ms": 3.42
  }
}

The terminal output uses human-readable labels (e.g., "Round trip time (ms)"), while the JSON output uses machine-readable keys (e.g., "round_trip_time_ms").

Adding New Commands#

New CLI subcommands can be added by creating a BaseCommand subclass and registering it. See Extending the CLI for details.

CLI Reference#

Installation#

Quick Start#

Available Commands#

bench — Engine Benchmarking#

describe — Service Status Dashboard#

Arguments#

JSON Output#

GPU KV Shape Abbreviations#

query#

ping — Liveness Check#

Options#

Exit Codes#

kvcache — KV Cache Management#

Metrics Output#

Terminal Output#

JSON Output#

Adding New Commands#

`bench` — Engine Benchmarking#

`describe` — Service Status Dashboard#

`query`#

`ping` — Liveness Check#

`kvcache` — KV Cache Management#