多服务器协调#

当您运行多个 LMCache 多进程(MP)服务器时,MP 协调器是一个独立服务,各服务器向其注册,从而为您提供覆盖所有运行中服务器的单一全局视图。每个 MP 服务器独立缓存;协调器将它们整合为一个统一管理的集群。

运行协调器#

协调器是一个 FastAPI 服务。使用以下命令启动它:

lmcache coordinator

预期的日志输出:

LMCache INFO: MP coordinator listening on http://0.0.0.0:9300

The CLI accepts --host, --port, --instance-timeout, --health-check-interval, --eviction-check-interval, --eviction-ratio, --trigger-watermark, --blend-chunk-size, --blend-probe-stride, and --timeout-keep-alive; any flag overrides the matching environment variable below. See lmcache coordinator for details. Equivalently, the coordinator can still be launched as a module with python3 -m lmcache.v1.mp_coordinator.

配置#

协调器通过 LMCACHE_MP_COORDINATOR_* 环境变量进行配置:

环境变量

默认

描述

LMCACHE_MP_COORDINATOR_HOST

0.0.0.0

HTTP服务器绑定的主机。

LMCACHE_MP_COORDINATOR_PORT

9300

HTTP服务器绑定的端口。

LMCACHE_MP_COORDINATOR_INSTANCE_TIMEOUT

30

超过此秒数未收到心跳后,服务器将从集群中移除。

LMCACHE_MP_COORDINATOR_HEALTH_CHECK_INTERVAL

10

健康检查扫描之间的秒数。0 禁用逐出。

LMCACHE_MP_COORDINATOR_EVICTION_CHECK_INTERVAL

5

L2 逐出清扫之间的秒数。0 禁用循环。

LMCACHE_MP_COORDINATOR_EVICTION_RATIO

0.2

每个周期逐出的跟踪键的比例(按计数,范围为 0.0 到 1.0)。

LMCACHE_MP_COORDINATOR_TRIGGER_WATERMARK

1.0

当使用量达到配额的该比例时触发逐出(取值范围为 (0.0, 1.0],不含 0.0)。

LMCACHE_MP_COORDINATOR_BLEND_CHUNK_SIZE

256

全局 CacheBlend 目录每块的令牌数。必须等于混合服务器使用的 LMCache 块大小。

LMCACHE_MP_COORDINATOR_BLEND_PROBE_STRIDE

1

在 CacheBlend 匹配探针之间的位置。1 在每个偏移量进行探测以实现完全回忆。

LMCACHE_MP_COORDINATOR_TIMEOUT_KEEP_ALIVE

10

Seconds the HTTP server keeps idle connections open before closing them. Must be greater than the MP servers' heartbeat interval (default 5), otherwise heartbeat requests may hit a closing connection and fail with Server disconnected without sending a response.

LMCACHE_MP_COORDINATOR_ENABLE_STARTUP_RESYNC

True

When True, the coordinator runs a one-shot L2 resync on startup that paginates an MP server's GET /cache/objects and backfills usage + eviction trackers from existing L2 contents. Disable to start from empty trackers (handy for tests, or deployments that start the coordinator before any MP server).

LMCACHE_MP_COORDINATOR_RESYNC_POLL_INTERVAL

1

在等待第一个 MP 服务器注册以便启动重新同步可以开始时,注册检查之间的秒数。

LMCACHE_MP_COORDINATOR_RESYNC_MAX_WAIT

60

最大秒数启动重新同步等待 MP 服务器在放弃之前。协调器保持运行,使用空的跟踪器,直到正常使用事件填充它们。

LMCACHE_MP_COORDINATOR_RESYNC_PAGE_SIZE

1000

page_size forwarded to the MP server's GET /cache/objects during resync. Larger values reduce RTT count; the server clamps to its own ceiling.

连接 MP 服务器#

当您通过 --coordinator-url 为 MP 服务器(lmcache server)指定协调器地址时,该服务器便会加入协调器。它在启动时注册、运行时持续发送心跳、关闭时注销——所有操作均在服务器自身的事件循环中完成。此功能为可选项:若未设置 URL,服务器的行为与之前完全相同。每个标志均可回退到对应的 LMCACHE_COORDINATOR_* 环境变量(在使用 Kubernetes Downward API 时非常便捷);显式标志优先于环境变量。

标志(在 MP 服务器上)

环境回退

描述

--coordinator-url

LMCACHE_COORDINATOR_URL

协调器基础 URL,例如 http://coordinator:9300。设置后启用注册。

--coordinator-advertise-ip

LMCACHE_COORDINATOR_ADVERTISE_IP

协调器应通过此服务器访问的 IP(默认为服务器的外部 IP)。

--coordinator-heartbeat-interval

LMCACHE_COORDINATOR_HEARTBEAT_INTERVAL

心跳之间的秒数(必须为 > 0,默认值为 5)。保持远低于协调器的 INSTANCE_TIMEOUT

--coordinator-l2-event-reporting

LMCACHE_COORDINATOR_L2_EVENT_REPORTING

启用向协调器上报 L2 存储/查找事件,用于全集群使用量跟踪及基于配额的逐出。

--coordinator-l2-event-flush-interval

LMCACHE_COORDINATOR_L2_EVENT_FLUSH_INTERVAL

L2 事件批量刷新之间的秒数(必须为 > 0,默认值为 1)。

服务器在其稳定身份下注册(--instance-id / OTel service.instance.id);如果未传递该标志,服务器将在启动时生成一个随机的 UUID v4 并在该 UUID 下注册。

注册是尽力而为的:如果协调器无法访问,MP 服务器会记录警告,持续重试,并继续提供服务。启动时会拒绝格式错误的心跳间隔值。

HTTP endpoints#

The coordinator's HTTP surface (base URL http://localhost:9300) groups into:

  • Fleet membership and health -- registration and liveness (/instances, /healthz).

  • Quota, usage, and eviction -- the /quota group: per-tenant byte budgets, usage accounting, and the usage-event ingest that drives fleet-wide eviction.

  • Cache control -- the /cache group: cache operations dispatched to a named server (currently warm prefetch, with more to come).

Each endpoint is documented below. Success is 200 unless noted, and {cache_salt} uses the _default sentinel for the empty salt. The wire types live in lmcache/v1/mp_coordinator/schemas.py.

Fleet membership and health#

MP servers register, heartbeat, and deregister automatically (see Connecting MP servers); GET /instances and GET /healthz are read-only operator views.

POST /instances#

Register (or re-register) an MP server. Called automatically by each server on startup.

Request body:

Field

Type

描述

ip

string

IP/host of the server's HTTP API; the coordinator dials this address, so it must be non-empty.

http_port

int

Port of the server's HTTP API.

instance_id

string

Optional. Server identifier; if omitted (or blank) the coordinator generates one and returns it.

metadata

object

Optional. Free-form string -> string registration hints.

p2p_advertised_url

string

Optional. URL the server advertises for peer-to-peer transfers; empty when it is not in P2P.

mq_port

int

Optional (default 0). ZMQ message-queue port P2P peers send lookup/unlock RPCs to; 0 when P2P is disabled.

Response (200 OK):

{"instance_id": "server-1", "re_registered": false}

instance_id is the registered id (the generated one when the request omitted it); re_registered is true when this replaced an existing registration.

HTTP status codes:

  • 200: registered.

  • 422: request body fails field-level validation (e.g. blank ip or out-of-range http_port).

Example:

curl -s -X POST http://localhost:9300/instances \
    -H 'Content-Type: application/json' \
    -d '{"ip": "10.0.0.5", "http_port": 8080}'
# -> {"instance_id": "mp-3f2c9d...", "re_registered": false}

PUT /instances/{instance_id}/heartbeat#

Record a liveness heartbeat. Called automatically while the server runs.

Path parameters: instance_id — the instance recording the heartbeat.

Response (200 OK):

{"instance_id": "server-1"}

HTTP status codes:

  • 200: heartbeat recorded.

  • 404: unknown instance — the caller should re-register via POST /instances.

Example:

curl -s -X PUT http://localhost:9300/instances/server-1/heartbeat
# -> {"instance_id": "server-1"}

DELETE /instances/{instance_id}#

Deregister an MP server. Called automatically on shutdown.

Path parameters: instance_id — the server to deregister.

Response: 204 No Content with an empty body, returned whether or not the instance was registered (idempotent).

HTTP status codes:

  • 204: deregistered (also returned for an unknown instance).

Example:

curl -s -X DELETE http://localhost:9300/instances/server-1 -o /dev/null -w '%{http_code}\n'
# -> 204

GET /instances#

List every registered MP server.

Response (200 OK):

{
  "instances": [
    {
      "instance_id": "server-1",
      "ip": "10.0.0.5",
      "http_port": 8080,
      "registration_time": 1719000000.0,
      "metadata": {},
      "p2p_advertised_url": "",
      "mq_port": 0
    }
  ]
}

Each entry reports the server's instance_id, the ip / http_port the coordinator reaches it at, the wall-clock registration_time (epoch seconds), any metadata supplied at registration, and the p2p_advertised_url / mq_port used for peer-to-peer transfers (empty / 0 when P2P is disabled).

HTTP status codes:

  • 200: fleet listed (an empty fleet returns {"instances": []}).

Example:

curl -s http://localhost:9300/instances

GET /healthz#

Coordinator liveness probe (for Kubernetes).

Response (200 OK):

{"status": "healthy"}

HTTP status codes:

  • 200: the coordinator is up.

Example:

curl -s http://localhost:9300/healthz
# -> {"status": "healthy"}

Quota, usage, and eviction#

The /quota group owns per-cache_salt byte budgets, the live usage accounting behind them, and the usage-event stream that drives fleet-wide eviction. (The MP server exposes a node-local /quota with the same shape; this is its fleet-wide counterpart.) Salts without a quota default to a 0-byte limit (allowlist semantics); use _default as the path parameter to target the empty-string salt.

When MP servers enable --coordinator-l2-event-reporting, they stream L2 store, lookup, and delete events to the coordinator, which aggregates per-cache_salt usage, enforces quotas, and selects LRU keys to evict. Each batch carries the server's instance_id and a monotonically increasing sequence number (seq) scoped to that instance, enabling future gap detection.

Active eviction loop. Every LMCACHE_MP_COORDINATOR_EVICTION_CHECK_INTERVAL seconds, the coordinator inspects per-salt usage against the registered quotas and, for any salt over the trigger watermark, picks LRU victims and dispatches a single DELETE /cache/objects to a uniformly random registered MP server. Because all MP servers share the same backing L2 (e.g. one S3 bucket), one dispatch evicts the keys for the whole fleet. The MP server's L2 adapter fires on_l2_keys_deleted listeners after the delete completes; those listeners ship delete events back through POST /quota/events, which is what updates the coordinator's LRU + per-salt totals. Dispatch failures or no-instances-registered fall through to the next cycle — at-least-once semantics, safe because the S3 delete is idempotent.

Startup resync. On boot, the coordinator waits up to LMCACHE_MP_COORDINATOR_RESYNC_MAX_WAIT seconds for the first MP server to register, then paginates its GET /cache/objects and seeds the in-memory usage + eviction trackers with whatever is already resident in L2 — so a fresh coordinator does not start from zero usage. Set LMCACHE_MP_COORDINATOR_ENABLE_STARTUP_RESYNC=False to skip this phase. Best-effort: resync failures are logged and the manager gives up; the ongoing usage-event stream from MP servers eventually corrects any initial blind spots.

PUT /quota/{cache_salt}#

Create or update a tenant's byte budget.

Path parameters: cache_salt — tenant identifier (_default for the empty salt).

Request body:

Field

Type

描述

limit_gb

float

Byte budget in GiB; must be >= 0 (0 clears the tenant's data on the next eviction cycle).

tier

string

Optional (default l2). Cache tier the quota applies to; only l2 is supported today.

Response (200 OK):

{"cache_salt": "user-a", "limit_gb": 10.0, "status": "ok"}

HTTP status codes:

  • 200: quota applied.

  • 400: invalid limit (negative or non-finite).

  • 422: request body fails field-level validation.

Example:

curl -s -X PUT http://localhost:9300/quota/user-a \
    -H 'Content-Type: application/json' \
    -d '{"limit_gb": 10.0}'
# -> {"cache_salt": "user-a", "limit_gb": 10.0, "status": "ok"}

DELETE /quota/{cache_salt}#

Remove a salt's quota entry. Any bytes still cached under it become over-budget on the next eviction cycle (effective limit drops to 0).

Path parameters: cache_salt — tenant identifier (_default for the empty salt).

Query parameters: tier — optional (default l2); cache tier the quota applies to.

Response (200 OK):

{"cache_salt": "user-a", "limit_gb": 0.0, "status": "removed"}

When no quota was registered for the salt, status is "not_found" (still 200 OK).

HTTP status codes:

  • 200: removed, or not_found if no quota existed.

Example:

curl -s -X DELETE http://localhost:9300/quota/user-a
# -> {"cache_salt": "user-a", "limit_gb": 0.0, "status": "removed"}

GET /quota/{cache_salt}#

Read the quota and live usage for a single salt.

Path parameters: cache_salt — tenant identifier (_default for the empty salt).

Query parameters: tier — optional (default l2).

Response (200 OK):

{"cache_salt": "user-a", "quota_limit_gb": 10.0, "quota_exists": true, "usage_gb": 0.001}

quota_limit_gb is the configured limit in GiB (0.0 when no quota is set), quota_exists whether an explicit quota is registered, and usage_gb the current aggregate usage. This endpoint never returns 404 for an unknown salt.

HTTP status codes:

  • 200: quota and usage reported.

Example:

curl -s http://localhost:9300/quota/user-a
# -> {"cache_salt": "user-a", "quota_limit_gb": 10.0, "quota_exists": true, "usage_gb": 0.001}

GET /quota#

List total usage and a per-salt breakdown.

Query parameters: tier — optional (default l2).

Response (200 OK):

{
  "total_gb": 0.005,
  "by_cache_salt": [
    {"cache_salt": "user-a", "quota_limit_gb": 10.0, "quota_exists": true, "usage_gb": 0.001}
  ]
}

total_gb is aggregate usage across all salts in GiB; each by_cache_salt entry has the same fields as the GET /quota/{cache_salt} response.

HTTP status codes:

  • 200: usage reported.

Example:

curl -s http://localhost:9300/quota
# -> {"total_gb": 0.005, "by_cache_salt": [...]}

POST /quota/events#

Ingest a batch of usage events. Sent automatically by reporting MP servers; not usually called by hand.

Request body:

Field

Type

描述

instance_id

string

The MP server that produced this batch.

seq

int

Monotonic per-instance sequence number (>= 1); supports future gap detection of lost batches.

tier

string

Optional (default l2). Cache tier the events apply to.

events

list[object]

The events to record. Each is {"type", "key", "bytes"}: type is "store", "lookup", or "delete"; key is the encoded object key; bytes (>= 0) is the stored size — counted for store and ignored for lookup / delete (a delete subtracts the size recorded at the original store).

Response (200 OK):

{"recorded": 3}

recorded is the number of events processed.

HTTP status codes:

  • 200: events processed.

  • 422: request body fails field-level validation.

Example:

curl -s -X POST http://localhost:9300/quota/events \
    -H 'Content-Type: application/json' \
    -d '{
        "instance_id": "server-1",
        "seq": 1,
        "events": [
            {"type": "store",  "key": {"chunk_hash_hex": "aa", "model_name": "m", "kv_rank": 0, "cache_salt": "user-a"}, "bytes": 1024},
            {"type": "lookup", "key": {"chunk_hash_hex": "aa", "model_name": "m", "kv_rank": 0, "cache_salt": "user-a"}, "bytes": 0},
            {"type": "delete", "key": {"chunk_hash_hex": "aa", "model_name": "m", "kv_rank": 0, "cache_salt": "user-a"}, "bytes": 0}
        ]
    }'
# -> {"recorded": 3}

Cache control#

The /cache group dispatches cache operations to a named MP server. Today it covers warm prefetch; further cache-control operations will be documented as endpoints here as they land.

Warm prefetch (pre-loading L1 from L2). Pre-warm one MP server's L1 with the KV for a known prompt before the requests arrive, so the first request hits L1 instead of paying the L2 fetch inline -- useful when you know a workload is about to be routed to a node (a traffic shift, a hot shared system prompt).

You describe the content by token ids -- the unit the cache speaks -- never by internal cache keys, which you cannot construct (a key is a content hash plus a per-rank layout bitmap). The coordinator forwards the request to the named server, which hashes the tokens, expands them across the node's ranks, loads the chunks from L2 into L1, and retains them so a later lookup hits. The submit returns a request_id; poll the status endpoint until completed. The warm acquires no lock -- the poll simply reports progress and clears the server-side job once the load finishes.

POST /cache/prefetches#

Submit a warm prefetch of a token sequence on one named server.

Request body:

Field

Type

描述

instance_id

string

Target MP server; must be registered.

model_name

string

Model whose layout sizes the target's L1 buffers.

world_size

int

World size (>= 1) selecting the KV layout and the per-rank fan-out (1 for a single-GPU, TP=1 deployment).

token_ids

list[int]

Prompt tokens whose complete chunk_size chunks are warmed; must match what was stored (same tokenizer / special tokens). A sub-chunk sequence is a noop.

cache_salt

string

Optional (default ""). Per-tenant isolation salt applied to the produced keys.

Response (200 OK):

{"instance_id": "server-1", "request_id": "abc123", "chunks": 12, "status": "submitted"}

When the sequence is shorter than one chunk, nothing is submitted and request_id is empty:

{"instance_id": "server-1", "request_id": "", "chunks": 0, "status": "noop"}

request_id is the id to poll; chunks is the number of whole chunks submitted to warm.

HTTP status codes:

  • 200: submitted (or a noop as above).

  • 404: unknown instance_id (not registered).

  • 502: the target server was unreachable or rejected the submit.

  • 422: request body fails field-level validation.

备注

Single-node scope: one instance_id warms only that node's shards. For a model sharded across multiple nodes, submit one request per node's instance.

Example:

curl -s -X POST http://localhost:9300/cache/prefetches \
    -H 'Content-Type: application/json' \
    -d '{
        "instance_id": "server-1",
        "model_name": "Qwen/Qwen3-8B",
        "world_size": 1,
        "token_ids": [101, 102, 103, "..."],
        "cache_salt": "user-a"
    }'
# -> {"instance_id": "server-1", "request_id": "abc123", "chunks": 12, "status": "submitted"}

GET /cache/prefetches/{instance_id}/{request_id}#

Poll a submitted warm prefetch; the response relays the owning server's status verbatim with its code.

Path parameters:

Field

Type

描述

instance_id

string

The server the prefetch was submitted to.

request_id

string

The id returned by POST /cache/prefetches.

Response (200 OK) while the load runs:

{"status": "pending"}

…and once complete:

{"status": "completed", "found_keys": 12, "total_keys": 12}

found_keys of total_keys requested chunks were resident.

HTTP status codes:

  • 200: status reported (pending or completed).

  • 404: unknown instance_id, or unknown request_id relayed from the server.

  • 502: the target server was unreachable.

Example:

curl -s http://localhost:9300/cache/prefetches/server-1/abc123
# -> {"status": "completed", "found_keys": 12, "total_keys": 12}