多服务器协调#
当您运行多个 LMCache 多进程(MP)服务器时,MP 协调器是一个独立服务,各服务器向其注册,从而为您提供覆盖所有运行中服务器的单一全局视图。每个 MP 服务器独立缓存;协调器将它们整合为一个统一管理的集群。
运行协调器#
协调器是一个 FastAPI 服务。使用以下命令启动它:
lmcache coordinator
预期的日志输出:
LMCache INFO: MP coordinator listening on http://0.0.0.0:9300
The CLI accepts --host, --port, --instance-timeout,
--health-check-interval, --eviction-check-interval,
--eviction-ratio, --trigger-watermark, --blend-chunk-size,
--blend-probe-stride, and --timeout-keep-alive; any flag overrides the
matching environment variable below. See lmcache coordinator for details.
Equivalently, the coordinator can still be launched as a module with
python3 -m lmcache.v1.mp_coordinator.
配置#
协调器通过 LMCACHE_MP_COORDINATOR_* 环境变量进行配置:
环境变量 |
默认 |
描述 |
|---|---|---|
|
|
HTTP服务器绑定的主机。 |
|
|
HTTP服务器绑定的端口。 |
|
|
超过此秒数未收到心跳后,服务器将从集群中移除。 |
|
|
健康检查扫描之间的秒数。 |
|
|
L2 逐出清扫之间的秒数。 |
|
|
每个周期逐出的跟踪键的比例(按计数,范围为 0.0 到 1.0)。 |
|
|
当使用量达到配额的该比例时触发逐出(取值范围为 (0.0, 1.0],不含 0.0)。 |
|
|
全局 CacheBlend 目录每块的令牌数。必须等于混合服务器使用的 LMCache 块大小。 |
|
|
在 CacheBlend 匹配探针之间的位置。 |
|
|
Seconds the HTTP server keeps idle connections open before closing
them. Must be greater than the MP servers' heartbeat interval
(default |
|
|
When |
|
|
在等待第一个 MP 服务器注册以便启动重新同步可以开始时,注册检查之间的秒数。 |
|
|
最大秒数启动重新同步等待 MP 服务器在放弃之前。协调器保持运行,使用空的跟踪器,直到正常使用事件填充它们。 |
|
|
|
连接 MP 服务器#
当您通过 --coordinator-url 为 MP 服务器(lmcache server)指定协调器地址时,该服务器便会加入协调器。它在启动时注册、运行时持续发送心跳、关闭时注销——所有操作均在服务器自身的事件循环中完成。此功能为可选项:若未设置 URL,服务器的行为与之前完全相同。每个标志均可回退到对应的 LMCACHE_COORDINATOR_* 环境变量(在使用 Kubernetes Downward API 时非常便捷);显式标志优先于环境变量。
标志(在 MP 服务器上) |
环境回退 |
描述 |
|---|---|---|
|
|
协调器基础 URL,例如 |
|
|
协调器应通过此服务器访问的 IP(默认为服务器的外部 IP)。 |
|
|
心跳之间的秒数(必须为 |
|
|
启用向协调器上报 L2 存储/查找事件,用于全集群使用量跟踪及基于配额的逐出。 |
|
|
L2 事件批量刷新之间的秒数(必须为 |
服务器在其稳定身份下注册(--instance-id / OTel service.instance.id);如果未传递该标志,服务器将在启动时生成一个随机的 UUID v4 并在该 UUID 下注册。
注册是尽力而为的:如果协调器无法访问,MP 服务器会记录警告,持续重试,并继续提供服务。启动时会拒绝格式错误的心跳间隔值。
HTTP endpoints#
The coordinator's HTTP surface (base URL http://localhost:9300) groups into:
Fleet membership and health -- registration and liveness (
/instances,/healthz).Quota, usage, and eviction -- the
/quotagroup: per-tenant byte budgets, usage accounting, and the usage-event ingest that drives fleet-wide eviction.Cache control -- the
/cachegroup: cache operations dispatched to a named server (currently warm prefetch, with more to come).
Each endpoint is documented below. Success is 200 unless noted, and
{cache_salt} uses the _default sentinel for the empty salt. The wire
types live in lmcache/v1/mp_coordinator/schemas.py.
Fleet membership and health#
MP servers register, heartbeat, and deregister automatically (see
Connecting MP servers); GET /instances and GET /healthz are read-only
operator views.
POST /instances#
Register (or re-register) an MP server. Called automatically by each server on startup.
Request body:
Field |
Type |
描述 |
|---|---|---|
|
string |
IP/host of the server's HTTP API; the coordinator dials this address, so it must be non-empty. |
|
int |
Port of the server's HTTP API. |
|
string |
Optional. Server identifier; if omitted (or blank) the coordinator generates one and returns it. |
|
object |
Optional. Free-form |
|
string |
Optional. URL the server advertises for peer-to-peer transfers; empty when it is not in P2P. |
|
int |
Optional (default |
Response (200 OK):
{"instance_id": "server-1", "re_registered": false}
instance_id is the registered id (the generated one when the request omitted
it); re_registered is true when this replaced an existing registration.
HTTP status codes:
200: registered.422: request body fails field-level validation (e.g. blankipor out-of-rangehttp_port).
Example:
curl -s -X POST http://localhost:9300/instances \
-H 'Content-Type: application/json' \
-d '{"ip": "10.0.0.5", "http_port": 8080}'
# -> {"instance_id": "mp-3f2c9d...", "re_registered": false}
PUT /instances/{instance_id}/heartbeat#
Record a liveness heartbeat. Called automatically while the server runs.
Path parameters: instance_id — the instance recording the heartbeat.
Response (200 OK):
{"instance_id": "server-1"}
HTTP status codes:
200: heartbeat recorded.404: unknown instance — the caller should re-register viaPOST /instances.
Example:
curl -s -X PUT http://localhost:9300/instances/server-1/heartbeat
# -> {"instance_id": "server-1"}
DELETE /instances/{instance_id}#
Deregister an MP server. Called automatically on shutdown.
Path parameters: instance_id — the server to deregister.
Response: 204 No Content with an empty body, returned whether or not the
instance was registered (idempotent).
HTTP status codes:
204: deregistered (also returned for an unknown instance).
Example:
curl -s -X DELETE http://localhost:9300/instances/server-1 -o /dev/null -w '%{http_code}\n'
# -> 204
GET /instances#
List every registered MP server.
Response (200 OK):
{
"instances": [
{
"instance_id": "server-1",
"ip": "10.0.0.5",
"http_port": 8080,
"registration_time": 1719000000.0,
"metadata": {},
"p2p_advertised_url": "",
"mq_port": 0
}
]
}
Each entry reports the server's instance_id, the ip / http_port the
coordinator reaches it at, the wall-clock registration_time (epoch seconds),
any metadata supplied at registration, and the p2p_advertised_url /
mq_port used for peer-to-peer transfers (empty / 0 when P2P is disabled).
HTTP status codes:
200: fleet listed (an empty fleet returns{"instances": []}).
Example:
curl -s http://localhost:9300/instances
GET /healthz#
Coordinator liveness probe (for Kubernetes).
Response (200 OK):
{"status": "healthy"}
HTTP status codes:
200: the coordinator is up.
Example:
curl -s http://localhost:9300/healthz
# -> {"status": "healthy"}
Quota, usage, and eviction#
The /quota group owns per-cache_salt byte budgets, the live usage
accounting behind them, and the usage-event stream that drives fleet-wide
eviction. (The MP server exposes a node-local /quota with the same shape;
this is its fleet-wide counterpart.) Salts without a quota default to a 0-byte
limit (allowlist semantics); use _default as the path parameter to target
the empty-string salt.
When MP servers enable --coordinator-l2-event-reporting, they stream L2
store, lookup, and delete events to the coordinator, which aggregates
per-cache_salt usage, enforces quotas, and selects LRU keys to evict. Each
batch carries the server's instance_id and a monotonically increasing
sequence number (seq) scoped to that instance, enabling future gap detection.
Active eviction loop. Every
LMCACHE_MP_COORDINATOR_EVICTION_CHECK_INTERVAL seconds, the
coordinator inspects per-salt usage against the registered quotas and,
for any salt over the trigger watermark, picks LRU victims and
dispatches a single DELETE /cache/objects to a uniformly random registered MP
server. Because all MP servers share the same backing L2 (e.g. one S3
bucket), one dispatch evicts the keys for the whole fleet. The MP
server's L2 adapter fires on_l2_keys_deleted listeners after the
delete completes; those listeners ship delete events back through
POST /quota/events, which is what updates the coordinator's LRU +
per-salt totals. Dispatch failures or no-instances-registered fall
through to the next cycle — at-least-once semantics, safe because the
S3 delete is idempotent.
Startup resync. On boot, the coordinator waits up to
LMCACHE_MP_COORDINATOR_RESYNC_MAX_WAIT seconds for the first MP
server to register, then paginates its
GET /cache/objects and seeds the in-memory usage + eviction trackers
with whatever is already resident in L2 — so a fresh coordinator
does not start from zero usage. Set
LMCACHE_MP_COORDINATOR_ENABLE_STARTUP_RESYNC=False to skip this
phase. Best-effort: resync failures are logged and the manager gives
up; the ongoing usage-event stream from MP servers eventually corrects
any initial blind spots.
PUT /quota/{cache_salt}#
Create or update a tenant's byte budget.
Path parameters: cache_salt — tenant identifier (_default for the
empty salt).
Request body:
Field |
Type |
描述 |
|---|---|---|
|
float |
Byte budget in GiB; must be |
|
string |
Optional (default |
Response (200 OK):
{"cache_salt": "user-a", "limit_gb": 10.0, "status": "ok"}
HTTP status codes:
200: quota applied.400: invalid limit (negative or non-finite).422: request body fails field-level validation.
Example:
curl -s -X PUT http://localhost:9300/quota/user-a \
-H 'Content-Type: application/json' \
-d '{"limit_gb": 10.0}'
# -> {"cache_salt": "user-a", "limit_gb": 10.0, "status": "ok"}
DELETE /quota/{cache_salt}#
Remove a salt's quota entry. Any bytes still cached under it become over-budget
on the next eviction cycle (effective limit drops to 0).
Path parameters: cache_salt — tenant identifier (_default for the
empty salt).
Query parameters: tier — optional (default l2); cache tier the quota
applies to.
Response (200 OK):
{"cache_salt": "user-a", "limit_gb": 0.0, "status": "removed"}
When no quota was registered for the salt, status is "not_found" (still
200 OK).
HTTP status codes:
200: removed, ornot_foundif no quota existed.
Example:
curl -s -X DELETE http://localhost:9300/quota/user-a
# -> {"cache_salt": "user-a", "limit_gb": 0.0, "status": "removed"}
GET /quota/{cache_salt}#
Read the quota and live usage for a single salt.
Path parameters: cache_salt — tenant identifier (_default for the
empty salt).
Query parameters: tier — optional (default l2).
Response (200 OK):
{"cache_salt": "user-a", "quota_limit_gb": 10.0, "quota_exists": true, "usage_gb": 0.001}
quota_limit_gb is the configured limit in GiB (0.0 when no quota is set),
quota_exists whether an explicit quota is registered, and usage_gb the
current aggregate usage. This endpoint never returns 404 for an unknown salt.
HTTP status codes:
200: quota and usage reported.
Example:
curl -s http://localhost:9300/quota/user-a
# -> {"cache_salt": "user-a", "quota_limit_gb": 10.0, "quota_exists": true, "usage_gb": 0.001}
GET /quota#
List total usage and a per-salt breakdown.
Query parameters: tier — optional (default l2).
Response (200 OK):
{
"total_gb": 0.005,
"by_cache_salt": [
{"cache_salt": "user-a", "quota_limit_gb": 10.0, "quota_exists": true, "usage_gb": 0.001}
]
}
total_gb is aggregate usage across all salts in GiB; each by_cache_salt
entry has the same fields as the GET /quota/{cache_salt} response.
HTTP status codes:
200: usage reported.
Example:
curl -s http://localhost:9300/quota
# -> {"total_gb": 0.005, "by_cache_salt": [...]}
POST /quota/events#
Ingest a batch of usage events. Sent automatically by reporting MP servers; not usually called by hand.
Request body:
Field |
Type |
描述 |
|---|---|---|
|
string |
The MP server that produced this batch. |
|
int |
Monotonic per-instance sequence number ( |
|
string |
Optional (default |
|
list[object] |
The events to record. Each is |
Response (200 OK):
{"recorded": 3}
recorded is the number of events processed.
HTTP status codes:
200: events processed.422: request body fails field-level validation.
Example:
curl -s -X POST http://localhost:9300/quota/events \
-H 'Content-Type: application/json' \
-d '{
"instance_id": "server-1",
"seq": 1,
"events": [
{"type": "store", "key": {"chunk_hash_hex": "aa", "model_name": "m", "kv_rank": 0, "cache_salt": "user-a"}, "bytes": 1024},
{"type": "lookup", "key": {"chunk_hash_hex": "aa", "model_name": "m", "kv_rank": 0, "cache_salt": "user-a"}, "bytes": 0},
{"type": "delete", "key": {"chunk_hash_hex": "aa", "model_name": "m", "kv_rank": 0, "cache_salt": "user-a"}, "bytes": 0}
]
}'
# -> {"recorded": 3}
Cache control#
The /cache group dispatches cache operations to a named MP server. Today it
covers warm prefetch; further cache-control operations will be documented as
endpoints here as they land.
Warm prefetch (pre-loading L1 from L2). Pre-warm one MP server's L1 with the KV for a known prompt before the requests arrive, so the first request hits L1 instead of paying the L2 fetch inline -- useful when you know a workload is about to be routed to a node (a traffic shift, a hot shared system prompt).
You describe the content by token ids -- the unit the cache speaks -- never
by internal cache keys, which you cannot construct (a key is a content hash
plus a per-rank layout bitmap). The coordinator forwards the request to the
named server, which hashes the tokens, expands them across the node's ranks,
loads the chunks from L2 into L1, and retains them so a later lookup hits.
The submit returns a request_id; poll the status endpoint until
completed. The warm acquires no lock -- the poll simply reports progress and
clears the server-side job once the load finishes.
POST /cache/prefetches#
Submit a warm prefetch of a token sequence on one named server.
Request body:
Field |
Type |
描述 |
|---|---|---|
|
string |
Target MP server; must be registered. |
|
string |
Model whose layout sizes the target's L1 buffers. |
|
int |
World size ( |
|
list[int] |
Prompt tokens whose complete |
|
string |
Optional (default |
Response (200 OK):
{"instance_id": "server-1", "request_id": "abc123", "chunks": 12, "status": "submitted"}
When the sequence is shorter than one chunk, nothing is submitted and
request_id is empty:
{"instance_id": "server-1", "request_id": "", "chunks": 0, "status": "noop"}
request_id is the id to poll; chunks is the number of whole chunks
submitted to warm.
HTTP status codes:
200: submitted (or anoopas above).404: unknowninstance_id(not registered).502: the target server was unreachable or rejected the submit.422: request body fails field-level validation.
备注
Single-node scope: one instance_id warms only that node's shards. For
a model sharded across multiple nodes, submit one request per node's instance.
Example:
curl -s -X POST http://localhost:9300/cache/prefetches \
-H 'Content-Type: application/json' \
-d '{
"instance_id": "server-1",
"model_name": "Qwen/Qwen3-8B",
"world_size": 1,
"token_ids": [101, 102, 103, "..."],
"cache_salt": "user-a"
}'
# -> {"instance_id": "server-1", "request_id": "abc123", "chunks": 12, "status": "submitted"}
GET /cache/prefetches/{instance_id}/{request_id}#
Poll a submitted warm prefetch; the response relays the owning server's status verbatim with its code.
Path parameters:
Field |
Type |
描述 |
|---|---|---|
|
string |
The server the prefetch was submitted to. |
|
string |
The id returned by |
Response (200 OK) while the load runs:
{"status": "pending"}
…and once complete:
{"status": "completed", "found_keys": 12, "total_keys": 12}
found_keys of total_keys requested chunks were resident.
HTTP status codes:
200: status reported (pendingorcompleted).404: unknowninstance_id, or unknownrequest_idrelayed from the server.502: the target server was unreachable.
Example:
curl -s http://localhost:9300/cache/prefetches/server-1/abc123
# -> {"status": "completed", "found_keys": 12, "total_keys": 12}