Health Monitor#

LMCache includes a comprehensive health monitoring framework that continuously monitors the health of the cache engine and its components. This feature is essential for production deployments to detect and respond to failures in remote storage backends.

Overview#

The Health Monitor provides:

  • Automatic health checks: Periodically monitors the health of all registered components

  • Extensible framework: Easily add custom health checks for new components

  • Remote backend monitoring: Built-in support for monitoring remote storage backends via ping

  • Degraded mode support: Automatically blocks operations when the system is unhealthy

  • Prometheus metrics integration: Health status exposed via metrics endpoint

Architecture#

The health monitoring system consists of three main components:

  1. HealthCheck (Abstract Base Class)

    Base class for individual health checks. Each health check represents one aspect of system health.

  2. HealthMonitor

    The central monitor that orchestrates all health checks. It runs in a background thread and periodically executes all registered health checks.

  3. RemoteBackendHealthCheck

    Built-in health check for remote storage backends. It pings the remote connector to verify connectivity.

Auto-Discovery#

The Health Monitor uses an auto-discovery mechanism to find and instantiate health checks:

  1. At startup, the monitor scans the lmcache.v1.health_monitor.checks package

  2. All classes that inherit from HealthCheck are discovered

  3. Each check’s create_from_engine() method is called to create instances

  4. The instances are registered with the monitor

This design allows you to add new health checks by simply creating a new module in the checks package.

Configuration#

Health monitor configuration is done through the extra_config section of your LMCache configuration:

Health Monitor Configuration Options#

Configuration Key

Default Value

Description

ping_interval

30.0

Interval (in seconds) between health check cycles

ping_timeout

5.0

Timeout (in seconds) for each ping operation

How It Works#

Runtime Behavior#

The health monitor runs in a background thread:

  1. Every ping_interval seconds, all health checks are executed

  2. If any check fails, the system is marked as unhealthy

  3. When unhealthy, store/retrieve operations are blocked with a warning log

  4. Once all checks pass again, the system is marked as healthy and operations resume

Graceful Degradation#

When the health monitor detects an unhealthy state:

  • Store operations: Skipped with a warning message

  • Retrieve operations: Return empty results with a warning message

  • Lookup operations: Return 0 (no cache hits) with a warning message

This prevents cascading failures when remote backends are unavailable.

Built-in Health Checks#

RemoteBackendHealthCheck#

This check monitors the connectivity to remote storage backends (e.g., Redis, Valkey).

What it checks:

  • Pings the remote connector to verify it’s reachable

  • Measures ping latency

  • Reports error codes for failures

When it’s active:

  • Only when a remote backend is configured (remote_url is set)

  • Only if the connector supports the ping() operation

Metrics reported:

  • lmcache:remote_ping_latency: Latest ping latency (milliseconds)

  • lmcache:remote_ping_error_code: Latest error code (0 = success)

  • lmcache:remote_ping_errors: Total number of ping errors

  • lmcache:remote_ping_successes: Total number of successful pings

Prometheus Metrics#

The health monitor exposes metrics through the Prometheus endpoint:

Health Monitor Metrics#

Metric Name

Type

Description

lmcache:is_healthy

Gauge

Overall system health status (1 = healthy, 0 = unhealthy)

lmcache:remote_ping_latency

Gauge

Latest ping latency to remote backends (milliseconds)

lmcache:remote_ping_error_code

Gauge

Latest ping error code (0 = success, -1 = timeout, -2 = generic error)

lmcache:remote_ping_errors

Counter

Total number of ping errors to remote backends

lmcache:remote_ping_successes

Counter

Total number of successful pings to remote backends

Error Codes#

The health check system uses the following error codes:

Health Check Error Codes#

Code

Description

0

Success - the health check passed

-1

Timeout - the ping operation exceeded the configured timeout

-2

Generic error - an unexpected error occurred during the health check

Extending the Health Monitor#

You can add custom health checks by creating a new module in the lmcache/v1/health_monitor/checks/ directory.

The custom check will be automatically discovered and registered when LMCache starts.