Metrics by vLLM API#
LMCache provides detailed metrics via a Prometheus endpoint, allowing for in-depth monitoring of cache performance and behavior.
This section outlines how to enable and configure observability from embedded vLLM /metrics
API endpoint.
Quick Start Guide#
1) On vLLM/LMCache side#
In v1, vLLM and LMCache run in separate processes, so you have to use multi‑process Prometheus.
The PROMETHEUS_MULTIPROC_DIR
environment variable must be the same in both processes, as a IPC directory.
PROMETHEUS_MULTIPROC_DIR=/tmp/lmcache_prometheus \
#.. other environment variables \
vllm serve $MODEL -port 8000 ...
Once the HTTP server is running, you can access the LMCache metrics at the /metrics
endpoint.
curl http://$<vllm-worker-ip>:8000/metrics | grep lmcache
# Replace $IP with the IP address of a vLLM worker
And you will also find some .db
files in the $PROMETHEUS_MULTIPROC_DIR
directory.
2) Prometheus Configuration#
To scrape the LMCache metrics with a Prometheus server, add the following job to your prometheus.yml
configuration,
or equivalent configuration to scrape the metrics endpoint:
scrape_configs:
- job_name: 'lmcache'
static_configs:
- targets: ['<vllm-worker-ip>:8000']
scrape_interval: 15s
Available Metrics#
LMCache exposes a variety of metrics to monitor its performance. The following table lists all available metrics organized by category:
Metric Name |
Type |
Description |
---|---|---|
Core Request Metrics |
||
|
Counter |
Total number of retrieve requests |
|
Counter |
Total number of store requests |
|
Counter |
Total number of lookup requests |
|
Counter |
Total number of tokens requested for retrieval |
|
Counter |
Total number of cache hit tokens from retrieval |
|
Counter |
Total number of tokens requested in lookup operations |
|
Counter |
Total number of tokens hit in lookup operations |
|
Counter |
Number of hit tokens in vLLM |
Hit Rate Metrics |
||
|
Gauge |
The hit rate for retrieve requests |
|
Gauge |
The hit rate for lookup requests |
Cache Usage Metrics |
||
|
Gauge |
Local cache usage in bytes |
|
Gauge |
Remote cache usage in bytes |
|
Gauge |
Local storage usage in bytes |
Performance Metrics |
||
|
Histogram |
Time taken to retrieve from the cache (seconds) |
|
Histogram |
Time taken to store to the cache (seconds) |
|
Histogram |
Retrieval speed (tokens per second) |
|
Histogram |
Storage speed (tokens per second) |
Remote Backend Metrics |
||
|
Counter |
Total number of read requests to remote backends |
|
Counter |
Total number of bytes read from remote backends |
|
Counter |
Total number of write requests to remote backends |
|
Counter |
Total number of bytes written to remote backends |
|
Histogram |
Time taken to get data from remote backends (milliseconds) |
|
Histogram |
Time taken to put data to remote backends (milliseconds) |
|
Histogram |
Time taken to get data from remote backends synchronously (milliseconds) |
Network Monitoring Metrics |
||
|
Gauge |
Latest ping latency to remote backends (milliseconds) |
|
Counter |
Number of ping errors to remote backends |
|
Counter |
Number of ping successes to remote backends |
|
Gauge |
Latest ping error code to remote backends |
Local CPU Backend Metrics |
||
|
Counter |
Total number of evictions in local CPU backend |
|
Counter |
Total number of evicted keys in local CPU backend |
|
Counter |
Total number of failed evictions in local CPU backend |
|
Gauge |
The size of the hot cache |
|
Gauge |
The size of the keys in request |
Memory Management Metrics |
||
|
Gauge |
The number of active memory objects |
|
Gauge |
The number of pinned memory objects |