Observability#
LMCache multiprocess mode provides three complementary observability modes: metrics (Prometheus counters via OTel), logging (Python logging with optional OTel log forwarding), and tracing (OTel spans for per-request latency).
All three modes are powered by an internal EventBus that decouples producers (L1Manager, StorageManager, MPCacheEngine) from subscribers.
Quick Start#
By default, metrics and logging are enabled; tracing is disabled. No extra flags are needed:
lmcache server \
--l1-size-gb 100 --eviction-policy LRU
To enable tracing, supply an OTLP endpoint:
lmcache server \
--l1-size-gb 100 --eviction-policy LRU \
--enable-tracing --otlp-endpoint http://localhost:4317
Configuration#
Argument |
Default |
Description |
|---|---|---|
|
off |
Master switch: disable the EventBus entirely (no metrics, logging, or tracing subscribers are registered). |
|
off |
Skip metrics subscribers (Prometheus endpoint is not started). |
|
off |
Skip logging subscribers. |
|
off |
Register tracing subscribers. Requires |
|
|
Maximum events in the EventBus queue before tail-drop. |
|
(none) |
OTLP gRPC endpoint (e.g. |
|
|
Port for the Prometheus |
|
|
Fraction of chunks/blocks to track for lifecycle histograms (0, 1.0]. Counters always count all events. Default is 1%. |
|
(none) |
Enable trace recording at the given level. Currently only
|
|
(none) |
Path to write the trace file. If omitted while |
Environment variables:
Variable |
Default |
Description |
|---|---|---|
|
|
Controls the log level for all LMCache loggers. Valid values:
|
Metrics#
Metrics are collected via OpenTelemetry counters and exported through an
in-process Prometheus /metrics HTTP endpoint (default port 9090).
When --otlp-endpoint is set, metrics are also pushed to the OTel
collector.
All metrics use the lmcache_mp. prefix (multiprocess). On Prometheus,
dots are converted to underscores and counters get a _total suffix
(e.g. lmcache_mp_l1_read_keys_total).
StorageManager Metrics#
Metric |
Type |
Description |
|---|---|---|
|
Counter |
Number of read (prefetch) requests received by the StorageManager. |
|
Counter |
Number of keys successfully read from LMCache. |
|
Counter |
Number of keys that failed to read. |
|
Counter |
Number of write (reserve) requests. |
|
Counter |
Number of keys successfully reserved for write. |
|
Counter |
Number of keys that failed to reserve (OOM, write conflict). |
L1 Metrics#
Metric |
Type |
Description |
|---|---|---|
|
Counter |
Number of keys read from L1. |
|
Counter |
Number of keys written to L1. |
|
Counter |
Number of keys evicted by the EvictionController. |
L1 Chunk Lifecycle Histograms#
Sampled (default 1%) chunk-level lifecycle tracking via
L1LifecycleSubscriber. Only sampled chunks contribute to histograms;
counters above always count all events. Sampling is deterministic
(hash-based), so the same key always gets the same decision with zero
memory overhead.
Metric |
Type |
Description |
|---|---|---|
|
Histogram |
Time from allocation to eviction per sampled chunk. |
|
Histogram |
Time from last access to eviction per sampled chunk. |
|
Histogram |
Time gap between consecutive touches (read or write) of the same chunk. |
|
Histogram |
Time from eviction to next reuse (capped at 300 s). |
L2 Metrics#
Metric |
Type |
Description |
|---|---|---|
|
Counter |
Number of L2 store tasks submitted. |
|
Counter |
Number of keys submitted for L2 store. |
|
Counter |
Number of L2 store tasks completed. |
|
Counter |
Number of keys successfully stored to L2. |
|
Counter |
Number of keys that failed to store to L2. |
|
Counter |
Number of L2 prefetch lookup requests. |
|
Counter |
Number of keys submitted for L2 prefetch lookup. |
|
Counter |
Number of prefix keys found in L2 lookup. |
|
Counter |
Number of L2 prefetch load tasks submitted. |
|
Counter |
Number of keys submitted for L2 load. |
|
Counter |
Number of keys successfully loaded from L2. |
|
Counter |
Number of keys that failed to load from L2. |
L0 (GPU) Block Lifecycle Histograms#
Sampled (default 1%) GPU KV cache block lifecycle tracking via
L0LifecycleSubscriber. Eviction is detected at reallocation time
(when a block is assigned different tokens). Sampling uses random
selection with a _skipped set (bounded by the finite number of
physical GPU blocks).
All L0 histograms are emitted with instance_id and model_name
OTel attributes, enabling per-instance and per-model metric slicing
in Prometheus (e.g.
lmcache_mp_l0_block_lifetime_seconds{instance_id="12345",model_name="llama-7b"}).
Metric |
Type |
Description |
|---|---|---|
|
Histogram |
Time from allocation to eviction per sampled GPU block. |
|
Histogram |
Time from last access to eviction per sampled GPU block. |
|
Histogram |
Time gaps between consecutive accesses of the same GPU block. |
Observable Gauges#
Point-in-time state snapshots registered via register_gauge
(pull-based OTel observable gauges).
Metric |
Type |
Description |
|---|---|---|
|
ObservableGauge |
Number of prefetch jobs currently in-flight. A sustained high value may indicate slow L2 backends or polling delays. |
Prometheus Scrape Configuration#
Add the LMCache server as a Prometheus scrape target:
scrape_configs:
- job_name: "lmcache-mp"
static_configs:
- targets: ["<lmcache-host>:9090"]
Logging#
Logging subscribers emit debug-level messages for store, retrieve, lookup,
L1, and StorageManager events via Python’s standard logging module.
When OpenTelemetry is installed, init_logger automatically attaches an
OTel LoggingHandler so that log records are forwarded to any configured
OTel LoggerProvider. The handler respects the LMCACHE_LOG_LEVEL
environment variable.
LMCACHE_LOG_LEVEL=DEBUG lmcache server ...
Key log messages:
Level |
Message |
|---|---|
INFO |
|
INFO |
|
INFO |
|
DEBUG |
|
DEBUG |
|
Tracing#
Note
--enable-tracing requires --otlp-endpoint to be set.
The server will refuse to start if tracing is enabled without an
OTLP endpoint, since there is no local fallback for trace export.
When tracing is enabled (--enable-tracing --otlp-endpoint <URL>),
the tracing subscriber creates OTel spans from START/END event pairs:
mp.store— fromMP_STORE_STARTtoMP_STORE_ENDmp.retrieve— fromMP_RETRIEVE_STARTtoMP_RETRIEVE_ENDmp.lookup_prefetch— fromMP_LOOKUP_PREFETCH_STARTtoMP_LOOKUP_PREFETCH_END
Each span carries event metadata as span attributes (e.g. device,
stored_count, found_count).
View traces in any OTel-compatible backend such as Jaeger or Grafana Tempo.
# Start Jaeger all-in-one (OTLP gRPC on 4317)
docker run -d --name jaeger \
-p 16686:16686 -p 4317:4317 \
jaegertracing/all-in-one:latest
# Start LMCache with tracing
lmcache server \
--l1-size-gb 100 --eviction-policy LRU \
--enable-tracing --otlp-endpoint http://localhost:4317
Trace Recording#
Note
Trace recording is distinct from --enable-tracing (OTel
spans). Trace recording captures every StorageManager public-API
call to a binary file so the same workload can be replayed later
for testing, regression hunting, and benchmarking — without needing
vLLM and (eventually) without a GPU. --enable-tracing exports
live OTel spans to an OTLP endpoint for online observability.
The two features are independent and can be used together.
When --trace-level storage is set, LMCache records every call to
StorageManager.{reserve_write, finish_write, submit_prefetch_task,
read_prefetched_results, finish_read_prefetched} to a binary file
for later replay.
Recording is off by default and adds near-zero overhead when off
(a single boolean check per StorageManager call). When on,
recording happens on the EventBus drain thread, off the request path.
Capturing a trace#
With an explicit output path:
lmcache server \
--l1-size-gb 100 --eviction-policy LRU \
--trace-level storage --trace-output /tmp/run.lct
With an implicit timestamped output path under $TMPDIR:
lmcache server \
--l1-size-gb 100 --eviction-policy LRU \
--trace-level storage
# → INFO log: "trace recording enabled (level=storage); no
# --trace-output given, writing to
# /tmp/lmcache-trace-<pid>-<UTC>.lct"
The trace file is closed cleanly on shutdown (SIGTERM is handled by the EventBus stop path).
Replay#
Replaying a recorded trace is delivered separately via the
lmcache trace and lmcache bench trace-replay CLIs.
What is captured (and what is not)#
Captured:
The fully-qualified name of every decorated
StorageManagercall.Each call’s input arguments (e.g.
keys,layout_desc,mode,extra_count,external_request_id).Wall-clock and monotonic timestamps of each call.
A header carrying
lmcacheversion, start times, and a SHA-256 digest of the activeStorageManagerConfigso replay can detect mismatched configurations.
Not captured:
KV tensor bytes. Replay exercises bookkeeping and controller logic; payloads at replay time are zeros.
Calls inside the
MPCacheEngine, the message queue, or any GPU-copy code. These layers are out of scope for the storage trace level.
File format#
A length-prefixed msgpack stream:
[4-byte big-endian length][msgpack Header]
[4-byte big-endian length][msgpack Record]
[4-byte big-endian length][msgpack Record]
...
The Header carries a magic prefix (LMCT), a format version,
the trace level (storage today), the LMCache version, start
timestamps, and the StorageManagerConfig digest. Each Record
carries a relative timestamp, a wall-clock timestamp, the
fully-qualified call site (qualname), and an argument dict.
The format is deliberately extensible: future trace levels
(mq, gpu) will share this layout and use the level header
field to discriminate. Additional captured ops add new qualname
strings without bumping the format version.
For the full design rationale see
docs/design/v1/mp_observability/trace.md in the source tree.