Metrics#
Metrics are collected via OpenTelemetry counters and exported through an
in-process Prometheus /metrics HTTP endpoint (default port 9090).
When --otlp-endpoint is set, metrics are also pushed to the OTel
collector.
All metrics use the lmcache_mp. prefix (multiprocess). On Prometheus,
dots are converted to underscores and counters get a _total suffix
(e.g. lmcache_mp_l1_read_chunks_total).
Global Resource Attributes#
Every metric and span exported by an MP server carries Resource-level
attributes built at startup. These identify the process producing the
telemetry and are orthogonal to per-metric attributes such as
cache_salt.
Attribute |
CLI flag / config |
Default when unset |
|---|---|---|
|
|
Random UUID v4 minted at startup. |
Resource attributes attach to the MeterProvider / TracerProvider
and propagate to every exported datapoint via OTLP. On Prometheus, SDK
resource attributes surface on the target_info series rather than
on each time-series — this is standard OTel behavior.
L1 Metrics#
Metric |
Type |
Description |
|---|---|---|
|
Counter (attr: |
Number of chunks read from L1, grouped by tenant. |
|
Counter (attr: |
Number of chunks written to L1, grouped by tenant. |
|
Counter (attr: |
Number of chunks evicted by the EvictionController, grouped by tenant. |
|
Counter |
L1 eviction-loop iterations (every cycle, regardless of whether
the watermark was crossed). Driven by |
|
Counter |
L1 eviction-loop iterations where |
L1 Chunk Lifecycle Histograms#
Sampled (default 1%) chunk-level lifecycle tracking via
L1LifecycleSubscriber. Only sampled chunks contribute to histograms;
counters above always count all events. Sampling is deterministic
(hash-based), so the same key always gets the same decision with zero
memory overhead.
Metric |
Type |
Description |
|---|---|---|
|
Histogram |
Time from allocation to eviction per sampled chunk. |
|
Histogram |
Time from last access to eviction per sampled chunk. |
|
Histogram |
Time gap between consecutive touches (read or write) of the same chunk. |
|
Histogram |
Time from eviction to next reuse (capped at 300 s). |
StorageManager Real-Reuse Metrics#
Workload-level reuse histograms emitted by SMLifecycleSubscriber,
driven by caller-facing StorageManager events
(SM_READ_PREFETCHED_FINISHED, SM_WRITE_FINISHED). Internal
read-lock releases by the store/prefetch controllers are excluded so
the signal reflects user-driven access only.
Both histograms are tagged with cache_salt for per-tenant
isolation. The per-salt access counter advances on every read and
write of every chunk (regardless of sampling) so the chunks-gap
reflects true storage volume; the histogram itself records gaps only
for chunks that pass the (deterministic, hash-based) sampling gate.
Metric |
Type |
Description |
|---|---|---|
|
Histogram (tag: |
Time gap between a chunk’s last access (read or write) and its next read. Captures storage cost — how long a stored chunk sat between accesses. Emitted only on read events. |
|
Histogram (tag: |
Per- |
L2 Metrics#
Metric |
Type |
Description |
|---|---|---|
|
Counter |
Number of L2 store requests submitted. |
|
Counter (attr: |
Number of chunks submitted for L2 store, grouped by tenant. |
|
Counter (attr: |
Number of L2 store requests completed, labeled by adapter type. |
|
Counter (attr: |
Number of chunks successfully stored to L2, grouped by tenant. |
|
Counter |
Number of L2 prefetch lookup requests. |
|
Counter (attr: |
Number of chunks submitted for L2 prefetch lookup, grouped by tenant. |
|
Counter |
Number of prefix chunks found in L2 lookup. |
|
Counter |
Number of L2 prefetch load requests submitted. |
|
Counter (attr: |
Number of chunks submitted for L2 load, grouped by tenant. |
|
Counter (attr: |
Number of chunks successfully loaded from L2, grouped by tenant. |
|
Counter (attr: |
Number of per-adapter L2 load requests completed, labeled by adapter type. |
|
Counter (attr: |
Number of chunks evicted from L2, grouped by tenant. |
The l2_name-labeled counters (l2_store_completed and
l2_load_completed) exist so dashboards can compute per-backend IOPS on
demand via rate(lmcache_mp_l2_store_completed_requests_total{l2_name="..."}[1m])
(and the equivalent for loads). No separate *_iops metric is exported;
keeping the raw counter lets dashboard users pick their own window.
Failure & Health Counters#
Health-monitoring counters emitted on the dedicated lmcache_mp.health
OTel meter. Driven by the L1FailureMetricsSubscriber and
L2FailureMetricsSubscriber, which are registered automatically when
metrics are enabled. All three counters carry model_name (extracted
from each ObjectKey) so operators can slice per-model on the
Prometheus /metrics endpoint.
Metric |
Type |
Description |
|---|---|---|
|
Counter |
L1 memory allocation failures (OOM) during |
|
Counter |
L1 |
|
Counter |
Chunks that L2 reported present at lookup but failed to land in L1.
Tagged by |
A reason=serde_failure value will be added to l2_prefetch_failure
as an additive, non-breaking extension once L2 adapters distinguish
deserialization errors from missing objects — no dashboard migration
needed when that lands.
For the full design rationale (including which event types drive each
counter and why lmcache_instance_id is deferred), see
docs/design/v1/mp_observability/METRICS.md in the source tree.
Lookup Hit-Rate Metrics#
Token-level counters whose ratio gives the fraction of tokens requested by a lookup that were served from either L1 or L2. L0 (GPU prefix cache) is intentionally excluded — it is vLLM-owned and not observable from LMCache.
Metric |
Type |
Description |
|---|---|---|
|
Counter (attrs: |
Total tokens submitted for lookup (denominator of the L1+L2 token-level hit rate). Only chunk-aligned tokens are counted. |
|
Counter (attrs: |
Total tokens found in L1 or L2 during lookup (numerator of the L1+L2 token-level hit rate). Counts the contiguous prefix hit only. |
Both counters are driven by the same event (MP_LOOKUP_PREFETCH_END),
so they always advance together per completed lookup. Early-exit lookups
contribute 0 to both, and abandoned lookups contribute to neither.
The model_name and cache_salt attributes are captured at lookup
time from IPCCacheServerKey so dashboards can compute per-model or
per-tenant hit rate. cache_salt can be high-cardinality (one entry
per tenant or isolation domain); drop it at scrape time with
metric_relabel_configs if storage cost matters.
PromQL for hit rate:
# Aggregate (all models, all salts):
rate(lmcache_mp_lookup_hit_tokens_total[5m])
/ rate(lmcache_mp_lookup_requested_tokens_total[5m])
# Per-model:
sum(rate(lmcache_mp_lookup_hit_tokens_total[5m])) by (model_name)
/ sum(rate(lmcache_mp_lookup_requested_tokens_total[5m])) by (model_name)
L0 (GPU) Block Lifecycle Histograms#
Sampled (default 1%) GPU KV cache block lifecycle tracking via
L0LifecycleSubscriber. Eviction is detected at reallocation time
(when a block is assigned different tokens). Sampling uses random
selection with a _skipped set (bounded by the finite number of
physical GPU blocks).
All L0 histograms are emitted with instance_id and model_name
OTel attributes, enabling per-instance and per-model metric slicing
in Prometheus (e.g.
lmcache_mp_l0_block_lifetime_seconds{instance_id="12345",model_name="llama-7b"}).
Metric |
Type |
Description |
|---|---|---|
|
Histogram |
Time from allocation to eviction per sampled GPU block. |
|
Histogram |
Time from last access to eviction per sampled GPU block. |
|
Histogram |
Time gaps between consecutive accesses of the same GPU block. |
L0 ↔ L1 Throughput Histograms#
Per-request throughput of GPU↔CPU copies via
L0L1ThroughputSubscriber. Every store/retrieve request contributes
one sample to the appropriate histogram:
total_bytes / (end_ts - start_ts) in GB/s. Timestamps come from
MP_{STORE,RETRIEVE}_{START,END} events published on the GPU cupy
stream, so they reflect true GPU-stream copy time — not Python/lock
overhead.
All throughput histograms are emitted with engine_id (vLLM worker
instance id), device (e.g. "cuda:3"), and model_name OTel
attributes, enabling per-worker, per-device, and per-model slicing in
Prometheus (e.g.
lmcache_mp_l0_l1_store_throughput_GB_per_second{engine_id="0",device="cuda:3",model_name="meta-llama/Llama-3.1-8B"}).
Metric |
Type |
Description |
|---|---|---|
|
Histogram |
GPU→CPU (L0→L1) store throughput in GB/s per request. |
|
Histogram |
CPU→GPU (L1→L0) load throughput in GB/s per request. |
L1 ↔ L2 Throughput Histograms#
Per-request throughput of L1↔L2 transfers via
L2ThroughputSubscriber. The store path correlates
L2_STORE_SUBMITTED → L2_STORE_COMPLETED by
(adapter_index, task_id). The load path correlates the per-adapter
L2_LOAD_TASK_SUBMITTED → L2_LOAD_TASK_COMPLETED events by
(request_id, adapter_index); the request-level
L2_PREFETCH_LOAD_* events used by the chunk-count counters aggregate
across adapters and cannot be attributed to a specific l2_name.
Timestamps span submit → complete, so the duration includes adapter queue, network, and disk I/O — the value is bytes / end-to-end latency, not raw transfer rate. Use these histograms to compare adapter types and catch regressions; use the L0↔L1 histograms when you need pure copy-time throughput.
All L1↔L2 throughput histograms carry a single l2_name OTel
attribute — the registered adapter type (e.g. "fs", "nixl_store",
"mooncake_store") — enabling per-backend slicing in Prometheus (e.g.
lmcache_mp_l2_store_throughput_GB_per_second{l2_name="nixl_store"}).
Metric |
Type |
Description |
|---|---|---|
|
Histogram |
L1→L2 store throughput in GB/s per request. |
|
Histogram |
L2→L1 load throughput in GB/s per (request, adapter) pair. |
Engine Counters#
Worker-scoped counters tied to what the MP server delivers back to each
vLLM worker via retrieve(). Labeled by worker_id (the vLLM
worker instance id) — distinct from any scheduler-scoped id that may
appear on other metrics.
Metric |
Type |
Description |
|---|---|---|
|
Counter (attrs: |
Total number of LMCache chunks loaded into the engine, summed
over all |
Observable Gauges#
Point-in-time state snapshots registered via register_gauge
(pull-based OTel observable gauges).
The three in-flight metrics carry two attributes that distinguish
adapters even when more than one is registered with the same backend
type — same shape as lmcache_mp.l2_store_completed:
l2_name— the registered adapter type (e.g."fs","nixl_store","mooncake_store").adapter_index— position in the controller’s adapter list.
Adapters with no in-flight work emit no datapoint for that scrape.
Metric |
Type |
Description |
|---|---|---|
|
ObservableGauge |
Number of prefetch jobs currently in-flight. A sustained high value may indicate slow L2 backends or polling delays. |
|
ObservableGauge |
Bytes currently held in L1. Rising without plateauing typically
indicates a leak; saturating at the configured |
|
ObservableGauge |
L1 used/total ratio ( |
|
ObservableGauge (attr: |
Bytes currently held in each L2 adapter, sampled at scrape time
from |
|
ObservableGauge (attrs: |
L2 store tasks currently executing, per adapter. Sustained non-zero values indicate the adapter cannot keep up with the L1 → L2 write rate. |
|
ObservableGauge (attrs: |
L2 → L1 prefetch load tasks currently executing, per adapter.
Pair with |
|
ObservableGauge (attrs: |
L1 bytes reserved by in-flight L2 → L1 prefetch loads, per
adapter. Rising in-flight bytes alongside rising
|
EventBus Self-Monitoring#
Health metrics for the EventBus itself, registered by
EventBusSelfMetricsSubscriber on the lmcache.event_bus OTel
meter. These metrics observe bus state directly via the EventBus
accessors and report on every OTel scrape — they are not driven by
events, so dropping or failing subscribers cannot silence them.
Use them to answer: is the EventBus keeping up with publishers, is
anything being dropped, and are any subscriber callbacks raising?
A non-zero dropped_events_total or a sustained non-zero
drain_lag_seconds indicates the bus is at --event-bus-queue-size
and tail-dropping; raise that flag or investigate slow subscribers.
Metric |
Type |
Description |
|---|---|---|
|
ObservableGauge |
Events currently queued in the EventBus ( |
|
ObservableGauge |
Seconds since the oldest queued event was published; |
|
ObservableCounter |
Cumulative events dropped because the EventBus queue was at
|
|
ObservableCounter (attr: |
Cumulative exceptions raised by subscriber callbacks during
EventBus dispatch, tagged by |
For the full design rationale and the in-process accessors that back
each metric see docs/design/v1/mp_observability/METRICS.md and
docs/design/v1/mp_observability/event-bus.md in the source tree.
Prometheus Scrape Configuration#
Add the LMCache server as a Prometheus scrape target:
scrape_configs:
- job_name: "lmcache-mp"
static_configs:
- targets: ["<lmcache-host>:9090"]