Multi-Server Coordination#
When you run more than one LMCache multiprocess (MP) server, the MP Coordinator is a standalone service they register with, giving you a single, fleet-wide view of every running server. Each MP server caches independently; the coordinator ties them together into one coordinated fleet.
Running the coordinator#
The coordinator is a FastAPI service. Start it with:
lmcache coordinator
Expected log output:
LMCache INFO: MP coordinator listening on http://0.0.0.0:9300
The CLI accepts --host, --port, --instance-timeout,
--health-check-interval, --eviction-check-interval,
--eviction-ratio, --trigger-watermark, --blend-chunk-size, and
--blend-probe-stride; any flag overrides the matching environment variable
below. See lmcache coordinator for details.
Equivalently, the coordinator can still be launched as a module with
python3 -m lmcache.v1.mp_coordinator.
Configuration#
The coordinator is configured through LMCACHE_MP_COORDINATOR_* environment
variables:
Environment variable |
Default |
Description |
|---|---|---|
|
|
Host the HTTP server binds to. |
|
|
Port the HTTP server binds to. |
|
|
Seconds without a heartbeat after which a server is dropped from the fleet. |
|
|
Seconds between health-check sweeps. |
|
|
Seconds between L2 eviction sweeps. |
|
|
Fraction of tracked keys (by count) to evict per cycle (0.0 to 1.0). |
|
|
Eviction fires when usage reaches this fraction of the quota (0.0 exclusive to 1.0). |
|
|
Tokens per chunk for the global CacheBlend directory. Must equal the LMCache chunk size the blend servers use. |
|
|
Positions between CacheBlend match probes. |
|
|
When |
|
|
Seconds between registry checks while waiting for the first MP server to register so startup resync can begin. |
|
|
Maximum seconds startup resync waits for an MP server before giving up. The coordinator keeps running with empty trackers until normal usage events fill them in. |
|
|
|
Connecting MP servers#
An MP server (lmcache server) joins the coordinator when you point it at one
with --coordinator-url. It registers on startup, heartbeats while running,
and deregisters on shutdown – all on the server’s own event loop. This is
opt-in: with no URL set, the server runs exactly as before. Each flag falls back
to a matching LMCACHE_COORDINATOR_* environment variable (handy for the
Kubernetes downward API); an explicit flag wins over the env var.
Flag (on the MP server) |
Env fallback |
Description |
|---|---|---|
|
|
Coordinator base URL, e.g. |
|
|
IP the coordinator should reach this server at (defaults to the server’s outbound IP). |
|
|
Seconds between heartbeats (must be |
|
|
Enable reporting L2 store/lookup events to the coordinator for fleet-wide usage tracking and quota-based eviction. |
|
|
Seconds between L2 event batch flushes (must be |
The server registers under its stable identity (--instance-id / OTel
service.instance.id); if the flag is not passed, the server mints a
random UUID v4 at startup and registers under that.
Registration is best-effort: if the coordinator is unreachable, the MP server logs a warning, keeps retrying, and continues serving. A malformed heartbeat-interval value is rejected at startup.
Inspecting the fleet#
Two read-only endpoints let you observe the coordinator:
GET /instances– list every registered MP server.GET /healthz– coordinator liveness probe (for Kubernetes).
curl -s http://localhost:9300/instances
# -> {"instances": [{"instance_id": "...", "ip": "10.0.0.5", "http_port": 8080, ...}]}
curl -s http://localhost:9300/healthz
# -> {"status": "healthy"}
L2 usage tracking and eviction#
When MP servers enable --coordinator-l2-event-reporting, they stream L2
store, lookup, and delete events to the coordinator. The coordinator
aggregates per-cache_salt usage, enforces quotas, and selects LRU keys
to evict.
Each event batch carries the server’s instance_id and a monotonically
increasing sequence number (seq) scoped to that instance. These fields
enable future gap detection to identify lost batches.
Active eviction loop. Every
LMCACHE_MP_COORDINATOR_EVICTION_CHECK_INTERVAL seconds, the
coordinator inspects per-salt usage against the registered quotas and,
for any salt over the trigger watermark, picks LRU victims and
dispatches a single DELETE /l2 to a uniformly random registered MP
server. Because all MP servers share the same backing L2 (e.g. one S3
bucket), one dispatch evicts the keys for the whole fleet. The MP
server’s L2 adapter fires on_l2_keys_deleted listeners after the
delete completes; those listeners ship delete events back through
POST /l2/events, which is what updates the coordinator’s LRU +
per-salt totals. Dispatch failures or no-instances-registered fall
through to the next cycle — at-least-once semantics, safe because the
S3 delete is idempotent.
Startup resync. On boot, the coordinator waits up to
LMCACHE_MP_COORDINATOR_RESYNC_MAX_WAIT seconds for the first MP
server to register, then paginates its
GET /l2/keys and seeds the in-memory usage + eviction trackers
with whatever is already resident in L2 — so a fresh coordinator
does not start from zero usage. Set
LMCACHE_MP_COORDINATOR_ENABLE_STARTUP_RESYNC=False to skip this
phase. Best-effort: resync failures are logged and the manager gives
up; the ongoing usage-event stream from MP servers eventually corrects
any initial blind spots.
Quota management – set per-cache_salt byte budgets. Salts without a
quota default to a 0-byte limit (allowlist semantics).
# Set a 10 GiB quota for tenant "user-a"
curl -s -X PUT http://localhost:9300/l2/quota/user-a \
-H 'Content-Type: application/json' \
-d '{"limit_gb": 10.0}'
# -> {"cache_salt": "user-a", "limit_gb": 10.0, "status": "ok"}
# Remove the quota
curl -s -X DELETE http://localhost:9300/l2/quota/user-a
# -> {"cache_salt": "user-a", "limit_gb": 0.0, "status": "removed"}
Use _default as the path parameter to target the empty-string salt.
Event ingestion – MP servers POST batched events; this is handled
automatically by the event listener and is not typically called manually.
Supported event types are store, lookup, and delete. A
delete event subtracts the key’s previously-recorded bytes from the
per-salt totals (the wire bytes field is ignored for delete;
the coordinator already knows the size from the original store).
curl -s -X POST http://localhost:9300/l2/events \
-H 'Content-Type: application/json' \
-d '{
"instance_id": "server-1",
"seq": 1,
"events": [
{"type": "store", "key": {"chunk_hash_hex": "aa", "model_name": "m", "kv_rank": 0, "cache_salt": "user-a"}, "bytes": 1024},
{"type": "lookup", "key": {"chunk_hash_hex": "aa", "model_name": "m", "kv_rank": 0, "cache_salt": "user-a"}, "bytes": 0},
{"type": "delete", "key": {"chunk_hash_hex": "aa", "model_name": "m", "kv_rank": 0, "cache_salt": "user-a"}, "bytes": 0}
]
}'
# -> {"recorded": 3}
Status queries – inspect usage and quota info.
# Single salt
curl -s http://localhost:9300/l2/status/user-a
# -> {"cache_salt": "user-a", "quota_limit_gb": 10.0, "quota_exists": true, "usage_gb": 0.001}
# All salts
curl -s http://localhost:9300/l2/status
# -> {"total_gb": 0.005, "by_cache_salt": [...]}
L2 endpoint summary#
Method |
Path |
Description |
|---|---|---|
|
|
Create or update a quota (body: |
|
|
Remove a salt’s quota entry. |
|
|
Ingest a batch of L2 |
|
|
Quota and usage for a single salt. |
|
|
Total usage and per-salt breakdown. |