Observability#
LMCache multiprocess mode includes two complementary observability systems: Prometheus metrics for aggregate counters and telemetry events for per-request tracing.
Prometheus Metrics#
Prometheus metrics are enabled by default (port 9090). Disable with
--disable-prometheus.
All metrics use the lmcache_mp: prefix to distinguish them from the
single-process lmcache: namespace.
StorageManager Read Metrics#
Metric |
Type |
Description |
|---|---|---|
|
Counter |
Number of read (prefetch) requests received by the StorageManager. |
|
Counter |
Number of keys successfully found in L1 during read. |
|
Counter |
Number of keys not found in L1 during read. |
StorageManager Write Metrics#
Metric |
Type |
Description |
|---|---|---|
|
Counter |
Number of write (reserve) requests. |
|
Counter |
Number of keys successfully reserved for write. |
|
Counter |
Number of keys that failed to reserve (OOM, write conflict). |
L1 Metrics#
Metric |
Type |
Description |
|---|---|---|
|
Counter |
Number of keys successfully read from L1. |
|
Counter |
Number of keys successfully written to L1. |
|
Counter |
Number of keys evicted from L1 by the EvictionController. |
Note
L2 metrics are not yet finalized and will be added in a future release.
Grafana / Prometheus Scrape Configuration#
Add the LMCache server as a Prometheus scrape target:
scrape_configs:
- job_name: "lmcache-mp"
static_configs:
- targets: ["<lmcache-host>:9090"]
Configuration#
Argument |
Default |
Description |
|---|---|---|
|
|
Disable Prometheus metrics. |
|
|
Port for the |
|
|
Flush interval (seconds) from internal stats to Prometheus counters. |
Telemetry Event System#
The telemetry system produces structured START/END event pairs for each server operation (lookup, store, retrieve). It is disabled by default and must be explicitly enabled.
Enabling Telemetry#
python3 -m lmcache.v1.multiprocess.server \
--l1-size-gb 100 --eviction-policy LRU \
--enable-telemetry \
--telemetry-processor '{"type": "logging", "log_level": "DEBUG"}'
Event Model#
Each telemetry event contains:
name: Operation name (e.g.,
lookup,retrieve,store).event_type:
STARTorEND.session_id: Request ID for correlating start/end pairs.
timestamp: High-resolution monotonic timestamp.
metadata: Operation-specific data (e.g.,
found_count,retrieved_count,device).
Processors#
Telemetry events are dispatched to one or more processors configured via
--telemetry-processor <JSON>.
Built-in: ``logging`` processor
Logs each event via LMCache’s logger at the specified level.
--telemetry-processor '{"type": "logging", "log_level": "DEBUG"}'
Sample output:
LMCache DEBUG: Telemetry: lookup START session=req-001 ts=12345.678 metadata={}
LMCache DEBUG: Telemetry: lookup END session=req-001 ts=12345.680 metadata={'found_count': 3}
Configuration#
Argument |
Default |
Description |
|---|---|---|
|
|
Enable telemetry. |
|
|
Maximum events in queue before tail-drop. |
|
(none) |
Processor spec JSON (repeatable). |
Logging#
LMCache uses Python’s logging module. Control the log level with the
LMCACHE_LOG_LEVEL environment variable:
LMCACHE_LOG_LEVEL=DEBUG python3 -m lmcache.v1.multiprocess.server ...
Key log messages to look for:
Level |
Message |
|---|---|
INFO |
|
INFO |
|
INFO |
|
DEBUG |
|
DEBUG |
|
DEBUG |
|