lmcache bench engine#

The lmcache bench engine command runs sustained performance benchmarks against an inference engine (e.g., vLLM). It supports multiple workload types that exercise different caching patterns and reports TTFT, decoding speed, and throughput metrics.

lmcache bench engine [options]

There are three ways to configure the benchmark:

  1. CLI arguments – pass all options on the command line.

  2. Interactive mode – run lmcache bench engine without required args and follow the step-by-step prompts.

  3. Config file – save a configuration to JSON and replay it with --config.

Quick Start#

Minimal (with all required arguments):

lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload long-doc-qa \
    --lmcache-url http://localhost:8080

Interactive mode (guided setup):

lmcache bench engine

The interactive mode walks you through each required setting, then asks whether you want to configure general and workload-specific options or use defaults.

From a saved config file:

lmcache bench engine --engine-url http://localhost:8000 \
    --config my_bench.json

Config files contain benchmark parameters (workload, KV cache settings, etc.) but not the engine URL, so you can reuse the same config against different engines.

Export a config without running the benchmark:

lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload long-doc-qa \
    --lmcache-url http://localhost:8080 \
    --export-config my_bench.json

This resolves all auto-detected values (model name, tokens per GB) and saves them to a portable JSON file that works without an LMCache server.

Non-interactive mode (for scripts and CI):

lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload long-doc-qa \
    --lmcache-url http://localhost:8080 \
    --no-interactive

Errors immediately if any required argument is missing, instead of entering interactive mode. Useful in automated pipelines.

If you don’t have an LMCache server, you can pass --tokens-per-gb-kvcache directly instead of --lmcache-url (see Finding --tokens-per-gb-kvcache for how to find this value).

General Options#

Flag

Required

Description

--config FILE

No

Load configuration from a JSON file. Skips interactive mode. CLI flags override values in the file. The engine URL is not stored in config files and must be provided separately.

--export-config FILE

No

Export resolved configuration to a JSON file and exit. Does not run the benchmark. Auto-detected values (model, tokens per GB) are resolved and saved so the config is portable. Environment- specific values (engine URL, LMCache URL) are excluded.

--no-interactive

No

Disable interactive mode. Errors if required arguments are missing instead of prompting. Useful for scripts and CI.

--engine-url URL

Yes

Inference engine URL (e.g., http://localhost:8000). Set OPENAI_API_KEY env var if authentication is needed.

--workload TYPE

Yes

Workload type: long-doc-qa, multi-round-chat, or random-prefill.

--tokens-per-gb-kvcache N

*

Tokens per GB of KV cache. Required unless --lmcache-url is set. See Finding --tokens-per-gb-kvcache for how to find this value.

--lmcache-url URL

No

LMCache HTTP server URL. When provided, --tokens-per-gb-kvcache is auto-detected from the server.

--model NAME

No

Model name. Auto-detected from the engine if omitted.

--kv-cache-volume GB

No

Target active KV cache volume in GB (default: 100).

--seed N

No

Random seed (default: 42).

--output-dir DIR

No

Directory for CSV and JSON output files (default: current directory).

--no-csv

No

Skip CSV export.

--json

No

Export a JSON summary file.

-q / --quiet

No

Suppress the real-time progress display.

Finding --tokens-per-gb-kvcache#

If you have an LMCache server running, the easiest approach is to pass --lmcache-url and let the tool auto-detect the value.

If you are using vLLM without LMCache, look for these lines in vLLM’s startup log:

INFO: Available KV cache memory: 12.34 GiB
INFO: GPU KV cache size: 567,890 tokens

Then compute:

tokens_per_gb = 567890 / 12.34 = 46,020

Workloads#

long-doc-qa#

Simulates repeated Q&A over long documents. Warmup sends each document once to populate the KV cache, then benchmark queries are dispatched with semaphore-controlled concurrency.

Flag

Default

Description

--ldqa-document-length

10000

Token length of each synthetic document.

--ldqa-query-per-document

2

Number of questions asked per document.

--ldqa-shuffle-policy

random

Request ordering: random (shuffled) or tile (round-by-round).

--ldqa-num-inflight-requests

3

Maximum concurrent in-flight requests.

Example:

lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload long-doc-qa \
    --lmcache-url http://localhost:8080 \
    --kv-cache-volume 50 \
    --ldqa-document-length 8000 \
    --ldqa-query-per-document 4 \
    --ldqa-shuffle-policy tile

multi-round-chat#

Simulates multi-round chat with stateful sessions. Creates concurrent user sessions, dispatches requests at a fixed QPS rate, and records responses in session history so each subsequent query includes prior context.

Flag

Default

Description

--mrc-shared-prompt-length

2000

System prompt token length per session.

--mrc-chat-history-length

10000

Pre-filled chat history token length.

--mrc-user-input-length

50

Tokens per user query.

--mrc-output-length

200

Max tokens to generate per response.

--mrc-qps

1.0

Target queries per second.

--mrc-duration

60.0

Benchmark duration in seconds.

Example:

lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload multi-round-chat \
    --lmcache-url http://localhost:8080 \
    --mrc-qps 2.0 \
    --mrc-duration 120

random-prefill#

Fires all requests simultaneously with max_tokens=1 to measure pure prefill performance. No warmup phase.

Flag

Default

Description

--rp-request-length

10000

Token length per prefill request.

--rp-num-requests

50

Number of requests to fire.

Example:

lmcache bench engine \
    --engine-url http://localhost:8000 \
    --workload random-prefill \
    --lmcache-url http://localhost:8080 \
    --rp-request-length 15000 \
    --rp-num-requests 100

Interactive Mode#

Interactive mode demo

When --engine-url or --workload is not provided (and --no-interactive is not set), the tool enters interactive mode. It guides you through four phases:

  1. Required settings – engine URL, workload type, LMCache server (or tokens per GB).

  2. General settings (optional gate) – model name, KV cache volume.

  3. Workload settings (optional gate) – workload-specific parameters.

  4. Summary and action – review configuration, then start the benchmark or export to a JSON file.

Each prompt focuses on a single setting. Selection prompts use arrow keys; text and number prompts accept typed input with defaults shown in brackets.

══════════════════════════════════════════════════
 lmcache bench engine -- Interactive Setup
══════════════════════════════════════════════════

Engine URL
  URL of the inference engine.
  [default: http://localhost:8000] >

Workload
  The type of benchmark workload to run.
  Use up/down to navigate, Enter to select.

  * long-doc-qa       Repeated Q&A over long documents
    multi-round-chat   Multi-turn chat with stateful sessions
    random-prefill     Prefill-only requests fired simultaneously

LMCache Server
  Do you have a running LMCache server?
  It can auto-detect KV cache size information.
  [default: Y] (Y/n) >

...

──────────────────────────────────────────────────
 Configuration Summary
──────────────────────────────────────────────────
  Workload:             long-doc-qa
  Model:                Qwen/Qwen3-14B
  Tokens per GB:        6553
  ...
──────────────────────────────────────────────────

What would you like to do?
  * Start benchmark
    Export configuration for later use and exit

When you choose “Export configuration”, all auto-detected values (model name, tokens per GB) are resolved and saved to a portable JSON file.

Config File#

Config files store benchmark parameters but not environment-specific values like engine URL or LMCache URL. This lets you reuse the same config across different environments.

You can create a config file in three ways:

  1. Interactive mode – choose “Export configuration” at the summary step.

  2. ``–export-config`` – resolve and export from CLI without running.

  3. Manually – write JSON with keys matching CLI arg names (dashes replaced by underscores).

Example config file:

{
  "model": "Qwen/Qwen3-14B",
  "workload": "long-doc-qa",
  "tokens_per_gb_kvcache": 6553,
  "kv_cache_volume": 100.0,
  "ldqa_document_length": 10000,
  "ldqa_query_per_document": 2,
  "ldqa_shuffle_policy": "random",
  "ldqa_num_inflight_requests": 3
}

Load it with --config (engine URL must be provided separately):

lmcache bench engine --engine-url http://localhost:8000 \
    --config my_bench.json

CLI arguments override config file values, so you can use a base config and tweak individual settings:

# Use saved config but override KV cache volume
lmcache bench engine --engine-url http://localhost:8000 \
    --config my_bench.json --kv-cache-volume 200

Output#

Terminal (real-time progress)#

During the benchmark, a live progress display shows in-flight requests, average TTFT, decode speed, and throughput. Suppress it with -q.

Terminal (final summary)#

After completion, a summary table is printed:

======= Engine Benchmark Result (long-doc-qa) ========
---------------------- Configuration ------------------
Engine URL:                       http://localhost:8000
Model:                            Qwen/Qwen3-14B
Workload:                         long-doc-qa
------------------------- Results ---------------------
Successful requests:              20
Failed requests:                  0
Benchmark duration (s):           31.34
Total input tokens:               200000
Total output tokens:              2560
Input throughput (tok/s):         6381.62
Output throughput (tok/s):        81.69
--------------- Time to First Token -------------------
Mean TTFT (ms):                   313.41
P50 TTFT (ms):                    272.83
P90 TTFT (ms):                    587.21
P99 TTFT (ms):                    837.32
------------------ Decoding Speed ---------------------
Mean decode (tok/s):              48.23
P50 decode (tok/s):               47.91
P90 decode (tok/s):               42.10
P99 decode (tok/s):               38.55
======================================================

CSV and JSON#

  • bench_results.csv – per-request metrics (TTFT, latency, decode speed, token counts). Written by default; skip with --no-csv.

  • bench_summary.json – aggregate statistics with percentiles and config metadata. Opt-in with --json.

Both files are written to --output-dir (default: current directory).

Exit Codes#

Code

Meaning

0

All requests succeeded.

1

One or more requests failed.