Health Monitor#
LMCache includes a comprehensive health monitoring framework that continuously monitors the health of the cache engine and its components. This feature is essential for production deployments to detect and respond to failures in remote storage backends.
Overview#
The Health Monitor provides:
Automatic health checks: Periodically monitors the health of all registered components
Extensible framework: Easily add custom health checks for new components
Remote backend monitoring: Built-in support for monitoring remote storage backends via ping
Degraded mode support: Automatically blocks operations when the system is unhealthy
Prometheus metrics integration: Health status exposed via metrics endpoint
Architecture#
The health monitoring system consists of three main components:
HealthCheck (Abstract Base Class)
Base class for individual health checks. Each health check represents one aspect of system health.
HealthMonitor
The central monitor that orchestrates all health checks. It runs in a background thread and periodically executes all registered health checks.
RemoteBackendHealthCheck
Built-in health check for remote storage backends. It pings the remote connector to verify connectivity.
Auto-Discovery#
The Health Monitor uses an auto-discovery mechanism to find and instantiate health checks:
At startup, the monitor scans the
lmcache.v1.health_monitor.checkspackageAll classes that inherit from
HealthCheckare discoveredEach check’s
create_from_engine()method is called to create instancesThe instances are registered with the monitor
This design allows you to add new health checks by simply creating a new module in the checks package.
Configuration#
Health monitor configuration is done through the extra_config section of your LMCache configuration:
Configuration Key |
Default Value |
Description |
|---|---|---|
|
|
Interval (in seconds) between health check cycles |
|
|
Timeout (in seconds) for each ping operation |
How It Works#
Runtime Behavior#
The health monitor runs in a background thread:
Every
ping_intervalseconds, all health checks are executedIf any check fails, the system is marked as unhealthy
When unhealthy, store/retrieve operations are blocked with a warning log
Once all checks pass again, the system is marked as healthy and operations resume
Graceful Degradation#
When the health monitor detects an unhealthy state:
Store operations: Skipped with a warning message
Retrieve operations: Return empty results with a warning message
Lookup operations: Return 0 (no cache hits) with a warning message
This prevents cascading failures when remote backends are unavailable.
Built-in Health Checks#
RemoteBackendHealthCheck#
This check monitors the connectivity to remote storage backends (e.g., Redis, Valkey).
What it checks:
Pings the remote connector to verify it’s reachable
Measures ping latency
Reports error codes for failures
When it’s active:
Only when a remote backend is configured (
remote_urlis set)Only if the connector supports the
ping()operation
Metrics reported:
lmcache:remote_ping_latency: Latest ping latency (milliseconds)lmcache:remote_ping_error_code: Latest error code (0 = success)lmcache:remote_ping_errors: Total number of ping errorslmcache:remote_ping_successes: Total number of successful pings
Prometheus Metrics#
The health monitor exposes metrics through the Prometheus endpoint:
Metric Name |
Type |
Description |
|---|---|---|
|
Gauge |
Overall system health status (1 = healthy, 0 = unhealthy) |
|
Gauge |
Latest ping latency to remote backends (milliseconds) |
|
Gauge |
Latest ping error code (0 = success, -1 = timeout, -2 = generic error) |
|
Counter |
Total number of ping errors to remote backends |
|
Counter |
Total number of successful pings to remote backends |
Error Codes#
The health check system uses the following error codes:
Code |
Description |
|---|---|
|
Success - the health check passed |
|
Timeout - the ping operation exceeded the configured timeout |
|
Generic error - an unexpected error occurred during the health check |
Extending the Health Monitor#
You can add custom health checks by creating a new module in the lmcache/v1/health_monitor/checks/ directory.
The custom check will be automatically discovered and registered when LMCache starts.