CLI Reference#

LMCache provides a unified lmcache command-line interface for interacting with KV cache servers, running benchmarks, and inspecting cache state.

lmcache <command> [options]

Installation#

The lmcache CLI is available in two packages:

Package

Install

When to use

lmcache

pip install lmcache

Full install: server, CLI, and CUDA extensions. Requires Linux + GPU.

lmcache-cli

pip install lmcache-cli

CLI only: query, ping, bench, describe. No GPU required, any OS.

Note

Do not install both packages in the same environment — they both provide the lmcache entry point.

Quick Start#

After installing LMCache, the lmcache command is available:

# Show available commands
lmcache -h

# Check if the KV cache server is alive
lmcache ping kvcache

# Launch the LMCache server (ZMQ + HTTP)
lmcache server --host 0.0.0.0 --port 5555 --l1-size-gb 100 --eviction-policy LRU

# Run a benchmark against the engine
lmcache bench engine --engine-url http://localhost:8000 \
    --workload long-doc-qa --lmcache-url http://localhost:8080

# JSON on stdout (for scripts)
lmcache ping kvcache --format json

# Save metrics to a file (format follows --format, default: terminal)
lmcache describe kvcache --format json --output status.json

Available Commands#

Command

Description

describe

Show detailed status of a running LMCache service, including cache health, L1 storage, registered models, and L2 adapters.

query

Single-shot query interface for both the serving engine and KV cache worker.

server

Launch the LMCache server (ZMQ + HTTP). Requires full lmcache install.

ping

Liveness check for LMCache or vLLM servers.

bench

Run sustained performance benchmarks against an inference engine.

kvcache

Manage KV cache state (e.g. clear L1 cache) on a running server.

bench — Engine Benchmarking#

Run sustained benchmarks against an inference engine with multiple workload types:

# Minimal: all required args on the command line
lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload long-doc-qa \
    --lmcache-url http://localhost:8080

# Interactive mode: guided step-by-step setup
lmcache bench engine

# From a saved config file (engine URL provided separately)
lmcache bench engine --engine-url http://localhost:8000 \
    --config my_bench.json

# Export config for later reuse (resolves auto-detected values)
lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload long-doc-qa \
    --lmcache-url http://localhost:8080 \
    --export-config my_bench.json

# Non-interactive mode for scripts/CI (errors if args missing)
lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload long-doc-qa \
    --lmcache-url http://localhost:8080 \
    --no-interactive

Three workloads are available:

  • long-doc-qa – repeated Q&A over long documents (tests KV cache reuse).

  • multi-round-chat – multi-turn chat with stateful sessions.

  • random-prefill – prefill-only requests fired simultaneously.

See lmcache bench engine for full documentation including all workload options, interactive mode details, and config file format.

describe — Service Status Dashboard#

Inspect the state of a running LMCache KV cache server:

lmcache describe kvcache --url http://localhost:8000
============ LMCache KV Cache Service ============
Health:                                         OK
URL:                         http://localhost:8000
Engine type:                           BlendEngine
Chunk size:                                    256
L1 capacity (GB):                            60.00
L1 used (GB):                        42.30 (70.5%)
Eviction policy:                               LRU
Cached objects:                               1024
Active sessions:                                 3
---- Model: meta-llama/Llama-3.1-70B-Instruct ----
Model:           meta-llama/Llama-3.1-70B-Instruct
World size:                                      4
GPU IDs:                                0, 1, 2, 3
Attention backend:    vLLM non-MLA flash attention
GPU KV shape:             NL x [2, NB, BS, NH, HS]
GPU KV tensor shape:   80 x [2, 2048, 128, 8, 128]
Num layers:                                     80
Block size:                                    128
Hidden dim size:                              1024
Dtype:                               torch.float16
MLA:                                         False
Num blocks:                                   2048
------------- L2: NixlStoreL2Adapter -------------
Type:                           NixlStoreL2Adapter
Health:                                         OK
Backend:                                 nixl_rdma
Stored objects:                                512
Pool used:                       480 / 512 (93.8%)
==================================================

The output shows:

  • Overview — health status, engine type, chunk size.

  • L1 storage — capacity, usage, eviction policy, cached object count.

  • Registered models — per-model KV cache layout including the GPU KV tensor shape (symbolic and concrete), attention backend, and layer details.

  • L2 adapters — type, health, backend, stored objects, and utilization.

Arguments#

Flag

Description

kvcache

Target to describe (currently only kvcache is supported).

--url

LMCache HTTP server URL (default: http://localhost:8080).

--format

Output format: terminal (default) or json.

--output PATH

Save metrics to a file (format follows --format).

JSON Output#

Use --format json for machine-readable output. Models and L2 adapters are collected into lists for easy programmatic access:

lmcache describe kvcache --url http://localhost:8000 --format json
{
  "title": "LMCache KV Cache Service",
  "metrics": {
    "health": "OK",
    "url": "http://localhost:8000",
    "engine_type": "BlendEngine",
    "chunk_size": 256,
    "l1_capacity_gb": 60.0,
    "l1_used_gb": "42.30 (70.5%)",
    "eviction_policy": "LRU",
    "cached_objects": 1024,
    "active_sessions": 3,
    "models": [
      {
        "model": "meta-llama/Llama-3.1-70B-Instruct",
        "world_size": 4,
        "gpu_ids": "0, 1, 2, 3",
        "attention_backend": "vLLM non-MLA flash attention",
        "gpu_kv_shape": "NL x [2, NB, BS, NH, HS]",
        "gpu_kv_concrete_shape": "80 x [2, 2048, 128, 8, 128]",
        "num_layers": 80,
        "block_size": 128,
        "hidden_dim_size": 1024,
        "dtype": "torch.float16",
        "is_mla": false,
        "num_blocks": 2048
      }
    ],
    "l2_adapters": [
      {
        "type": "NixlStoreL2Adapter",
        "health": "OK",
        "backend": "nixl_rdma",
        "stored_object_count": 512,
        "pool_used": "480 / 512 (93.8%)"
      }
    ]
  }
}

GPU KV Shape Abbreviations#

The gpu_kv_shape field uses short names from the GPUKVFormat enum:

Abbrev

Meaning

NB

num_blocks

NL

num_layers

BS

block_size

NH

num_heads

HS

head_size

PBS

page_buffer_size (NB × BS)

query#

The query engine subcommand sends one request to the engine API and reports metrics. --prompt supports placeholders: {lmcache} loads lmcache/cli/documents/lmcache.txt, and custom documents can be passed with --documents NAME=PATH. The prompt token count is taken directly from the usage data reported by the engine (stream_options: {include_usage: true}).

 lmcache query engine --url http://localhost:8000/v1 \
   --prompt "{lmcache} Summarize LMCache usage." \
   --format terminal \
   --max-tokens 128

================= Query Engine =================
Model:                         facebook/opt-125m
Input tokens:                                618
--------------- Latency Metrics ----------------
Output tokens:                                 9
TTFT (ms):                                 26.88
TPOT (ms/token):                            0.91
Total latency (ms):                        35.05
Throughput (tokens/s):                   1100.64
================================================

ping — Liveness Check#

Check whether an LMCache KV cache server or a vLLM serving engine is reachable:

# Ping the KV cache server (default: http://localhost:8080)
lmcache ping kvcache

# Ping the serving engine (default: http://localhost:8000)
lmcache ping engine --url http://localhost:8000
======= Ping KV Cache ========
Status:                   OK
Round trip time (ms):     3.42
==============================

Options#

Flag

Description

kvcache | engine

Target to ping (positional, required).

--url

Server URL. Defaults to http://localhost:8080 for kvcache, http://localhost:8000 for engine.

--format

Output format: terminal (default) or json.

--output PATH

Save metrics to a file (format follows --format).

-q / --quiet

Suppress stdout output. Exit code only.

Exit Codes#

Code

Meaning

0

Server is reachable (HTTP 200).

1

Connection failure or non-200 response.

kvcache — KV Cache Management#

Manage KV cache state on a running LMCache server. See lmcache kvcache for full documentation including examples, options, and common patterns.

Quick example:

# Clear all L1 (CPU) cache
lmcache kvcache clear --url http://localhost:8000

Metrics Output#

All commands that produce metrics support two output formats:

Terminal Output#

Human-readable ASCII table:

======= Ping KV Cache ========
Status:                   OK
Round trip time (ms):     3.42
==============================

JSON Output#

Machine-readable output with structured keys, available via --format json (stdout) or --output (file):

lmcache ping kvcache --format json
{
  "title": "Ping KV Cache",
  "metrics": {
    "status": "OK",
    "round_trip_time_ms": 3.42
  }
}

The terminal output uses human-readable labels (e.g., "Round trip time (ms)"), while the JSON output uses machine-readable keys (e.g., "round_trip_time_ms").

Adding New Commands#

New CLI subcommands can be added by creating a BaseCommand subclass and registering it. See Extending the CLI for details.