Tracing#
Note
--enable-tracing requires --otlp-endpoint to be set.
The server will refuse to start if tracing is enabled without an
OTLP endpoint, since there is no local fallback for trace export.
When tracing is enabled (--enable-tracing --otlp-endpoint <URL>),
the tracing subscriber creates OTel spans from START/END event pairs:
mp.store— fromMP_STORE_STARTtoMP_STORE_ENDmp.retrieve— fromMP_RETRIEVE_STARTtoMP_RETRIEVE_ENDmp.lookup_prefetch— fromMP_LOOKUP_PREFETCH_STARTtoMP_LOOKUP_PREFETCH_END
Each span carries event metadata as span attributes (e.g. device,
stored_count, found_count).
View traces in any OTel-compatible backend such as Jaeger or Grafana Tempo.
# Start Jaeger all-in-one (OTLP gRPC on 4317)
docker run -d --name jaeger \
-p 16686:16686 -p 4317:4317 \
jaegertracing/all-in-one:latest
# Start LMCache with tracing
lmcache server \
--l1-size-gb 100 --eviction-policy LRU \
--enable-tracing --otlp-endpoint http://localhost:4317
Per-Request Hit-Rate Attributes#
Each session is wrapped in a per-request root span — request for the
standard MP path and cb.request for the CacheBlend path — that nests
all child spans (mp.store, mp.retrieve, mp.lookup_prefetch)
beneath it. When the lookup phase ends, the root span is annotated with
three OTel attributes that summarise the request-level cache hit rate:
Attribute |
OTel type |
Description |
|---|---|---|
|
|
Tokens served from L1+L2 (numerator). |
|
|
Chunk-aligned tokens submitted for lookup (denominator). |
|
|
|
The attributes are written when MP_LOOKUP_PREFETCH_END (standard MP
path) or CB_LOOKUP_END (CacheBlend path) is processed — while the
root span is still open. Store-only requests that never call
lookup_prefetch_start() emit no end event for the lookup phase, so
their root span will not carry these attributes.
Example TraceQL queries (Grafana Tempo):
# Requests with less than 50% cache hit rate
{ name = "request" && span.hit_rate < 0.5 }
# Full cache hits only
{ name = "request" && span.hit_rate = 1.0 }
# Complete misses (lookup ran but nothing was cached)
{ name = "request" && span.requested_tokens > 0 && span.hit_tokens = 0 }
For the full event-to-span mapping and the registry pattern that links
child spans back to the root see
docs/design/observability/request-event-span.md in the source tree.
Trace Recording#
Note
Trace recording is distinct from --enable-tracing (OTel
spans). Trace recording captures every StorageManager public-API
call to a binary file so the same workload can be replayed later
for testing, regression hunting, and benchmarking — without needing
vLLM and (eventually) without a GPU. --enable-tracing exports
live OTel spans to an OTLP endpoint for online observability.
The two features are independent and can be used together.
When --trace-level storage is set, LMCache records every call to
StorageManager.{reserve_write, finish_write, submit_prefetch_task,
read_prefetched_results, finish_read_prefetched} to a binary file
for later replay.
Recording is off by default and adds near-zero overhead when off
(a single boolean check per StorageManager call). When on,
recording happens on the EventBus drain thread, off the request path.
Capturing a trace#
With an explicit output path:
lmcache server \
--l1-size-gb 100 --eviction-policy LRU \
--trace-level storage --trace-output /tmp/run.lct
With an implicit timestamped output path under $TMPDIR:
lmcache server \
--l1-size-gb 100 --eviction-policy LRU \
--trace-level storage
# → INFO log: "trace recording enabled (level=storage); no
# --trace-output given, writing to
# /tmp/lmcache-trace-<pid>-<UTC>.lct"
The trace file is closed cleanly on shutdown (SIGTERM is handled by the EventBus stop path).
Replay#
Replaying a recorded trace, plus the full set of CLI flags for driving, monitoring, and exporting replay results, is covered in its own page: Tracing and Debugging.
What is captured (and what is not)#
Captured:
The fully-qualified name of every decorated
StorageManagercall.Each call’s input arguments (e.g.
keys,layout_desc,mode,extra_count,external_request_id).Wall-clock and monotonic timestamps of each call.
A header carrying a trace schema version, start times, and a SHA-256 digest of the active
StorageManagerConfigso replay can detect mismatched configurations.
Not captured:
KV tensor bytes. Replay exercises bookkeeping and controller logic; payloads at replay time are zeros.
Calls inside the
MPCacheServer, the message queue, or any GPU-copy code. These layers are out of scope for the storage trace level.
File format#
A length-prefixed msgpack stream:
[4-byte big-endian length][msgpack Header]
[4-byte big-endian length][msgpack Record]
[4-byte big-endian length][msgpack Record]
...
The Header carries a magic prefix (LMCT), a format version,
the trace level (storage today), a trace schema version, start
timestamps, and the StorageManagerConfig digest. Each Record
carries a relative timestamp, a wall-clock timestamp, the
fully-qualified call site (qualname), and an argument dict.
The format is deliberately extensible: future trace levels
(mq, gpu) will share this layout and use the level header
field to discriminate. Additional captured ops add new qualname
strings without bumping the format version.
For the full design rationale see