lmcache query#
The lmcache query command sends a single OpenAI-compatible inference
request and reports token and latency metrics. It has two targets:
lmcache query {engine,kvcache} [options]
engine— send one request to a serving engine’s HTTP API.kvcache— query KV-cache endpoints (not implemented yet).
query engine#
The query engine subcommand sends one request to the engine API and
reports metrics. --prompt supports placeholders: {lmcache} loads
lmcache/cli/documents/lmcache.txt, and custom documents can be passed with
--documents NAME=PATH. The prompt token count is taken directly from the
usage data reported by the engine (stream_options: {include_usage: true}).
lmcache query engine --url http://localhost:8000/v1 \
--prompt "{lmcache} Summarize LMCache usage." \
--format terminal \
--max-tokens 128
================= Query Engine =================
Model: facebook/opt-125m
Input tokens: 618
--------------- Latency Metrics ----------------
Output tokens: 9
TTFT (ms): 26.88
TPOT (ms/token): 0.91
Total latency (ms): 35.05
Throughput (tokens/s): 1100.64
================================================
Options#
Flag |
Required |
Description |
|---|---|---|
|
Yes |
Serving engine base URL (e.g. |
|
Yes |
Prompt text with optional |
|
No |
Model ID for the serving engine. Auto-detected from the engine’s reported usage if omitted. |
|
No |
Maximum completion tokens (default: 128). |
|
No |
HTTP timeout in seconds (default: 30). |
|
No |
Load file text for |
|
No |
Use |
|
No |
Try |
|
No |
Output format: |
|
No |
Save metrics to a file (format follows |
|
No |
Suppress stdout output. Exit code only. |