Observability#
LMCache multiprocess mode provides three complementary observability modes: metrics (Prometheus counters via OTel), logging (Python logging with optional OTel log forwarding), and tracing (OTel spans for per-request latency).
All three modes are powered by an internal EventBus that decouples producers (L1Manager, StorageManager, MPCacheEngine) from subscribers.
Quick Start#
By default, metrics and logging are enabled; tracing is disabled. No extra flags are needed:
lmcache server \
--l1-size-gb 100 --eviction-policy LRU
To enable tracing, supply an OTLP endpoint:
lmcache server \
--l1-size-gb 100 --eviction-policy LRU \
--enable-tracing --otlp-endpoint http://localhost:4317
Configuration#
Argument |
Default |
Description |
|---|---|---|
|
off |
Master switch: disable the EventBus entirely (no metrics, logging, or tracing subscribers are registered). |
|
off |
Skip metrics subscribers (Prometheus endpoint is not started). |
|
off |
Skip logging subscribers. |
|
off |
Register tracing subscribers. Requires |
|
|
Maximum events in the EventBus queue before tail-drop. |
|
(none) |
OTLP gRPC endpoint (e.g. |
|
|
Port for the Prometheus |
|
(unset, default UUID v4) |
Identifier for this MP server instance. Attached as the OTel
Resource attribute |
|
|
Fraction of chunks/blocks to track for lifecycle histograms (0, 1.0]. Counters always count all events. Default is 1%. |
|
(none) |
Enable trace recording at the given level. Currently only
|
|
(none) |
Path to write the trace file. If omitted while |
Environment variables:
Variable |
Default |
Description |
|---|---|---|
|
|
Controls the log level for all LMCache loggers. Valid values:
|
Metrics#
Metrics are collected via OpenTelemetry counters and exported through an
in-process Prometheus /metrics HTTP endpoint (default port 9090).
When --otlp-endpoint is set, metrics are also pushed to the OTel
collector.
All metrics use the lmcache_mp. prefix (multiprocess). On Prometheus,
dots are converted to underscores and counters get a _total suffix
(e.g. lmcache_mp_l1_read_keys_total).
Global Resource Attributes#
Every metric and span exported by an MP server carries Resource-level
attributes built at startup. These identify the process producing the
telemetry and are orthogonal to per-metric attributes such as
cache_salt.
Attribute |
CLI flag / config |
Default when unset |
|---|---|---|
|
|
Random UUID v4 minted at startup. |
Resource attributes attach to the MeterProvider / TracerProvider
and propagate to every exported datapoint via OTLP. On Prometheus, SDK
resource attributes surface on the target_info series rather than
on each time-series — this is standard OTel behavior.
StorageManager Metrics#
Metric |
Type |
Description |
|---|---|---|
|
Counter |
Number of read (prefetch) requests received by the StorageManager. |
|
Counter |
Number of keys successfully read from LMCache. |
|
Counter |
Number of keys that failed to read. |
|
Counter |
Number of write (reserve) requests. |
|
Counter |
Number of keys successfully reserved for write. |
|
Counter |
Number of keys that failed to reserve (OOM, write conflict). |
L1 Metrics#
Metric |
Type |
Description |
|---|---|---|
|
Counter |
Number of keys read from L1. |
|
Counter |
Number of keys written to L1. |
|
Counter |
Number of keys evicted by the EvictionController. |
|
Counter |
L1 eviction-loop iterations (every cycle, regardless of whether
the watermark was crossed). Driven by |
|
Counter |
L1 eviction-loop iterations where |
L1 Chunk Lifecycle Histograms#
Sampled (default 1%) chunk-level lifecycle tracking via
L1LifecycleSubscriber. Only sampled chunks contribute to histograms;
counters above always count all events. Sampling is deterministic
(hash-based), so the same key always gets the same decision with zero
memory overhead.
Metric |
Type |
Description |
|---|---|---|
|
Histogram |
Time from allocation to eviction per sampled chunk. |
|
Histogram |
Time from last access to eviction per sampled chunk. |
|
Histogram |
Time gap between consecutive touches (read or write) of the same chunk. |
|
Histogram |
Time from eviction to next reuse (capped at 300 s). |
StorageManager Real-Reuse Metrics#
Workload-level reuse histograms emitted by SMLifecycleSubscriber,
driven by caller-facing StorageManager events
(SM_READ_PREFETCHED_FINISHED, SM_WRITE_FINISHED). Internal
read-lock releases by the store/prefetch controllers are excluded so
the signal reflects user-driven access only.
Both histograms are tagged with cache_salt for per-tenant
isolation. The per-salt access counter advances on every read and
write of every chunk (regardless of sampling) so the chunks-gap
reflects true storage volume; the histogram itself records gaps only
for chunks that pass the (deterministic, hash-based) sampling gate.
Metric |
Type |
Description |
|---|---|---|
|
Histogram (tag: |
Time gap between a chunk’s last access (read or write) and its next read. Captures storage cost — how long a stored chunk sat between accesses. Emitted only on read events. |
|
Histogram (tag: |
Per- |
L2 Metrics#
Metric |
Type |
Description |
|---|---|---|
|
Counter |
Number of L2 store tasks submitted. |
|
Counter |
Number of keys submitted for L2 store. |
|
Counter (attr: |
Number of L2 store tasks completed, labeled by adapter type. |
|
Counter |
Number of keys successfully stored to L2. |
|
Counter |
Number of keys that failed to store to L2. |
|
Counter |
Number of L2 prefetch lookup requests. |
|
Counter |
Number of keys submitted for L2 prefetch lookup. |
|
Counter |
Number of prefix keys found in L2 lookup. |
|
Counter |
Number of L2 prefetch load tasks submitted. |
|
Counter |
Number of keys submitted for L2 load. |
|
Counter |
Number of keys successfully loaded from L2. |
|
Counter |
Number of keys that failed to load from L2. |
|
Counter (attr: |
Number of per-adapter L2 load tasks completed, labeled by adapter type. |
The l2_name-labeled counters (l2_store_completed and
l2_load_completed) exist so dashboards can compute per-backend IOPS on
demand via rate(lmcache_mp_l2_store_completed_total{l2_name="..."}[1m])
(and the equivalent for loads). No separate *_iops metric is exported;
keeping the raw counter lets dashboard users pick their own window.
Failure & Health Counters#
Health-monitoring counters emitted on the dedicated lmcache_mp.health
OTel meter. Driven by the L1FailureMetricsSubscriber and
L2FailureMetricsSubscriber, which are registered automatically when
metrics are enabled. All three counters carry model_name (extracted
from each ObjectKey) so operators can slice per-model on the
Prometheus /metrics endpoint.
Metric |
Type |
Description |
|---|---|---|
|
Counter |
L1 memory allocation failures (OOM) during |
|
Counter |
L1 |
|
Counter |
Keys that L2 reported present at lookup but failed to land in L1.
Tagged by |
A reason=serde_failure value will be added to l2_prefetch_failure
as an additive, non-breaking extension once L2 adapters distinguish
deserialization errors from missing objects — no dashboard migration
needed when that lands.
For the full design rationale (including which event types drive each
counter and why lmcache_instance_id is deferred), see
docs/design/v1/mp_observability/METRICS.md in the source tree.
Lookup Hit-Rate Metrics#
Token-level counters whose ratio gives the fraction of tokens requested by a lookup that were served from either L1 or L2. L0 (GPU prefix cache) is intentionally excluded — it is vLLM-owned and not observable from LMCache.
Metric |
Type |
Description |
|---|---|---|
|
Counter (attrs: |
Total tokens submitted for lookup (denominator of the L1+L2 token-level hit rate). Only chunk-aligned tokens are counted. |
|
Counter (attrs: |
Total tokens found in L1 or L2 during lookup (numerator of the L1+L2 token-level hit rate). Counts the contiguous prefix hit only. |
Both counters are driven by the same event (MP_LOOKUP_PREFETCH_END),
so they always advance together per completed lookup. Early-exit lookups
contribute 0 to both, and abandoned lookups contribute to neither.
The model_name and cache_salt attributes are captured at lookup
time from IPCCacheEngineKey so dashboards can compute per-model or
per-tenant hit rate. cache_salt can be high-cardinality (one entry
per tenant or isolation domain); drop it at scrape time with
metric_relabel_configs if storage cost matters.
PromQL for hit rate:
# Aggregate (all models, all salts):
rate(lmcache_mp_lookup_hit_tokens_total[5m])
/ rate(lmcache_mp_lookup_requested_tokens_total[5m])
# Per-model:
sum(rate(lmcache_mp_lookup_hit_tokens_total[5m])) by (model_name)
/ sum(rate(lmcache_mp_lookup_requested_tokens_total[5m])) by (model_name)
L0 (GPU) Block Lifecycle Histograms#
Sampled (default 1%) GPU KV cache block lifecycle tracking via
L0LifecycleSubscriber. Eviction is detected at reallocation time
(when a block is assigned different tokens). Sampling uses random
selection with a _skipped set (bounded by the finite number of
physical GPU blocks).
All L0 histograms are emitted with instance_id and model_name
OTel attributes, enabling per-instance and per-model metric slicing
in Prometheus (e.g.
lmcache_mp_l0_block_lifetime_seconds{instance_id="12345",model_name="llama-7b"}).
Metric |
Type |
Description |
|---|---|---|
|
Histogram |
Time from allocation to eviction per sampled GPU block. |
|
Histogram |
Time from last access to eviction per sampled GPU block. |
|
Histogram |
Time gaps between consecutive accesses of the same GPU block. |
L0 ↔ L1 Throughput Histograms#
Sampled (default 1%) per-request throughput of GPU↔CPU copies via
L0L1ThroughputSubscriber. Each sampled request contributes one sample
to the appropriate histogram: total_bytes / (end_ts - start_ts) in
GB/s. Timestamps come from MP_{STORE,RETRIEVE}_{START,END} events
published on the GPU cupy stream, so they reflect true GPU-stream copy
time — not Python/lock overhead.
All throughput histograms are emitted with engine_id (vLLM worker
instance id), device (e.g. "cuda:3"), and model_name OTel
attributes, enabling per-worker, per-device, and per-model slicing in
Prometheus (e.g.
lmcache_mp_l0_l1_store_throughput_gbs{engine_id="0",device="cuda:3",model_name="meta-llama/Llama-3.1-8B"}).
Metric |
Type |
Description |
|---|---|---|
|
Histogram |
GPU→CPU (L0→L1) store throughput in GB/s per sampled request. |
|
Histogram |
CPU→GPU (L1→L0) load throughput in GB/s per sampled request. |
L1 ↔ L2 Throughput Histograms#
Sampled (default 1%) per-task throughput of L1↔L2 transfers via
L2ThroughputSubscriber. The store path correlates
L2_STORE_SUBMITTED → L2_STORE_COMPLETED by
(adapter_index, task_id). The load path correlates the per-adapter
L2_LOAD_TASK_SUBMITTED → L2_LOAD_TASK_COMPLETED events by
(request_id, adapter_index); the request-level
L2_PREFETCH_LOAD_* events used by the key-count counters aggregate
across adapters and cannot be attributed to a specific l2_name.
Timestamps span submit → complete, so the duration includes adapter queue, network, and disk I/O — the value is bytes / end-to-end latency, not raw transfer rate. Use these histograms to compare adapter types and catch regressions; use the L0↔L1 histograms when you need pure copy-time throughput.
All L1↔L2 throughput histograms carry a single l2_name OTel
attribute — the registered adapter type (e.g. "fs", "nixl_store",
"mooncake_store") — enabling per-backend slicing in Prometheus (e.g.
lmcache_mp_l2_store_throughput_gbs{l2_name="nixl_store"}).
Metric |
Type |
Description |
|---|---|---|
|
Histogram |
L1→L2 store throughput in GB/s per sampled task. |
|
Histogram |
L2→L1 load throughput in GB/s per sampled (request, adapter) pair. |
Engine Counters#
Worker-scoped counters tied to what the MP server delivers back to each
vLLM worker via retrieve(). Labeled by worker_id (the vLLM
worker instance id) — distinct from any scheduler-scoped id that may
appear on other metrics.
Metric |
Type |
Description |
|---|---|---|
|
Counter (attrs: |
Total number of LMCache chunks loaded into the engine, summed
over all |
Observable Gauges#
Point-in-time state snapshots registered via register_gauge
(pull-based OTel observable gauges).
The three in-flight metrics carry two attributes that distinguish
adapters even when more than one is registered with the same backend
type — same shape as lmcache_mp.l2_store_completed:
l2_name— the registered adapter type (e.g."fs","nixl_store","mooncake_store").adapter_index— position in the controller’s adapter list.
Adapters with no in-flight work emit no datapoint for that scrape.
Metric |
Type |
Description |
|---|---|---|
|
ObservableGauge |
Number of prefetch jobs currently in-flight. A sustained high value may indicate slow L2 backends or polling delays. |
|
ObservableGauge |
Bytes currently held in L1. Rising without plateauing typically
indicates a leak; saturating at the configured |
|
ObservableGauge |
L1 used/total ratio ( |
|
ObservableGauge (attrs: |
L2 store tasks currently executing, per adapter. Sustained non-zero values indicate the adapter cannot keep up with the L1 → L2 write rate. |
|
ObservableGauge (attrs: |
L2 → L1 prefetch load tasks currently executing, per adapter.
Pair with |
|
ObservableGauge (attrs: |
L1 bytes reserved by in-flight L2 → L1 prefetch loads, per
adapter. Rising in-flight bytes alongside rising
|
EventBus Self-Monitoring#
Health metrics for the EventBus itself, registered by
EventBusSelfMetricsSubscriber on the lmcache.event_bus OTel
meter. These metrics observe bus state directly via the EventBus
accessors and report on every OTel scrape — they are not driven by
events, so dropping or failing subscribers cannot silence them.
Use them to answer: is the EventBus keeping up with publishers, is
anything being dropped, and are any subscriber callbacks raising?
A non-zero dropped_events_total or a sustained non-zero
drain_lag_seconds indicates the bus is at --event-bus-queue-size
and tail-dropping; raise that flag or investigate slow subscribers.
Metric |
Type |
Description |
|---|---|---|
|
ObservableGauge |
Events currently queued in the EventBus ( |
|
ObservableGauge |
Seconds since the oldest queued event was published; |
|
ObservableCounter |
Cumulative events dropped because the EventBus queue was at
|
|
ObservableCounter (attr: |
Cumulative exceptions raised by subscriber callbacks during
EventBus dispatch, tagged by |
For the full design rationale and the in-process accessors that back
each metric see docs/design/v1/mp_observability/METRICS.md and
docs/design/v1/mp_observability/event-bus.md in the source tree.
Prometheus Scrape Configuration#
Add the LMCache server as a Prometheus scrape target:
scrape_configs:
- job_name: "lmcache-mp"
static_configs:
- targets: ["<lmcache-host>:9090"]
Logging#
Logging subscribers emit debug-level messages for store, retrieve, lookup,
L1, and StorageManager events via Python’s standard logging module.
When OpenTelemetry is installed, init_logger automatically attaches an
OTel LoggingHandler so that log records are forwarded to any configured
OTel LoggerProvider. The handler respects the LMCACHE_LOG_LEVEL
environment variable.
LMCACHE_LOG_LEVEL=DEBUG lmcache server ...
Key log messages:
Level |
Message |
|---|---|
INFO |
|
INFO |
|
INFO |
|
DEBUG |
|
DEBUG |
|
Tracing#
Note
--enable-tracing requires --otlp-endpoint to be set.
The server will refuse to start if tracing is enabled without an
OTLP endpoint, since there is no local fallback for trace export.
When tracing is enabled (--enable-tracing --otlp-endpoint <URL>),
the tracing subscriber creates OTel spans from START/END event pairs:
mp.store— fromMP_STORE_STARTtoMP_STORE_ENDmp.retrieve— fromMP_RETRIEVE_STARTtoMP_RETRIEVE_ENDmp.lookup_prefetch— fromMP_LOOKUP_PREFETCH_STARTtoMP_LOOKUP_PREFETCH_END
Each span carries event metadata as span attributes (e.g. device,
stored_count, found_count).
View traces in any OTel-compatible backend such as Jaeger or Grafana Tempo.
# Start Jaeger all-in-one (OTLP gRPC on 4317)
docker run -d --name jaeger \
-p 16686:16686 -p 4317:4317 \
jaegertracing/all-in-one:latest
# Start LMCache with tracing
lmcache server \
--l1-size-gb 100 --eviction-policy LRU \
--enable-tracing --otlp-endpoint http://localhost:4317
Per-Request Hit-Rate Attributes#
Each session is wrapped in a per-request root span — request for the
standard MP path and cb.request for the CacheBlend path — that nests
all child spans (mp.store, mp.retrieve, mp.lookup_prefetch)
beneath it. When the lookup phase ends, the root span is annotated with
three OTel attributes that summarise the request-level cache hit rate:
Attribute |
OTel type |
Description |
|---|---|---|
|
|
Tokens served from L1+L2 (numerator). |
|
|
Chunk-aligned tokens submitted for lookup (denominator). |
|
|
|
The attributes are written when MP_LOOKUP_PREFETCH_END (standard MP
path) or CB_LOOKUP_END (CacheBlend path) is processed — while the
root span is still open. Store-only requests that never call
lookup_prefetch_start() emit no end event for the lookup phase, so
their root span will not carry these attributes.
Example TraceQL queries (Grafana Tempo):
# Requests with less than 50% cache hit rate
{ name = "request" && span.hit_rate < 0.5 }
# Full cache hits only
{ name = "request" && span.hit_rate = 1.0 }
# Complete misses (lookup ran but nothing was cached)
{ name = "request" && span.requested_tokens > 0 && span.hit_tokens = 0 }
For the full event-to-span mapping and the registry pattern that links
child spans back to the root see
docs/design/observability/request-event-span.md in the source tree.
Trace Recording#
Note
Trace recording is distinct from --enable-tracing (OTel
spans). Trace recording captures every StorageManager public-API
call to a binary file so the same workload can be replayed later
for testing, regression hunting, and benchmarking — without needing
vLLM and (eventually) without a GPU. --enable-tracing exports
live OTel spans to an OTLP endpoint for online observability.
The two features are independent and can be used together.
When --trace-level storage is set, LMCache records every call to
StorageManager.{reserve_write, finish_write, submit_prefetch_task,
read_prefetched_results, finish_read_prefetched} to a binary file
for later replay.
Recording is off by default and adds near-zero overhead when off
(a single boolean check per StorageManager call). When on,
recording happens on the EventBus drain thread, off the request path.
Capturing a trace#
With an explicit output path:
lmcache server \
--l1-size-gb 100 --eviction-policy LRU \
--trace-level storage --trace-output /tmp/run.lct
With an implicit timestamped output path under $TMPDIR:
lmcache server \
--l1-size-gb 100 --eviction-policy LRU \
--trace-level storage
# → INFO log: "trace recording enabled (level=storage); no
# --trace-output given, writing to
# /tmp/lmcache-trace-<pid>-<UTC>.lct"
The trace file is closed cleanly on shutdown (SIGTERM is handled by the EventBus stop path).
Replay#
Replaying a recorded trace, plus the full set of CLI flags for driving, monitoring, and exporting replay results, is covered in its own page: Tracing and Debugging.
What is captured (and what is not)#
Captured:
The fully-qualified name of every decorated
StorageManagercall.Each call’s input arguments (e.g.
keys,layout_desc,mode,extra_count,external_request_id).Wall-clock and monotonic timestamps of each call.
A header carrying a trace schema version, start times, and a SHA-256 digest of the active
StorageManagerConfigso replay can detect mismatched configurations.
Not captured:
KV tensor bytes. Replay exercises bookkeeping and controller logic; payloads at replay time are zeros.
Calls inside the
MPCacheEngine, the message queue, or any GPU-copy code. These layers are out of scope for the storage trace level.
File format#
A length-prefixed msgpack stream:
[4-byte big-endian length][msgpack Header]
[4-byte big-endian length][msgpack Record]
[4-byte big-endian length][msgpack Record]
...
The Header carries a magic prefix (LMCT), a format version,
the trace level (storage today), a trace schema version, start
timestamps, and the StorageManagerConfig digest. Each Record
carries a relative timestamp, a wall-clock timestamp, the
fully-qualified call site (qualname), and an argument dict.
The format is deliberately extensible: future trace levels
(mq, gpu) will share this layout and use the level header
field to discriminate. Additional captured ops add new qualname
strings without bumping the format version.
For the full design rationale see
docs/design/v1/mp_observability/trace.md in the source tree.