多服务器协调#

当您运行多个 LMCache 多进程（MP）服务器时，MP 协调器是一个独立服务，各服务器向其注册，从而为您提供覆盖所有运行中服务器的单一全局视图。每个 MP 服务器独立缓存；协调器将它们整合为一个统一管理的集群。

运行协调器 #

协调器是一个 FastAPI 服务。使用以下命令启动它：

lmcache coordinator

预期的日志输出：

LMCache INFO: MP coordinator listening on http://0.0.0.0:9300

CLI 接受 --host、--port、--instance-timeout、--health-check-interval、--eviction-check-interval、--eviction-ratio、--trigger-watermark、--chunk-size、--hash-algorithm、--blend-probe-stride 和 --timeout-keep-alive；任何标志都会覆盖下面匹配的环境变量。有关详细信息，请参见 lmcache coordinator。同样，协调器仍然可以通过 python3 -m lmcache.v1.mp_coordinator 作为模块启动。

配置 #

协调器通过 LMCACHE_MP_COORDINATOR_* 环境变量进行配置：

环境变量	默认	描述
`LMCACHE_MP_COORDINATOR_HOST`	`0.0.0.0`	HTTP服务器绑定的主机。
`LMCACHE_MP_COORDINATOR_PORT`	`9300`	HTTP服务器绑定的端口。
`LMCACHE_MP_COORDINATOR_INSTANCE_TIMEOUT`	`30`	超过此秒数未收到心跳后，服务器将从集群中移除。
`LMCACHE_MP_COORDINATOR_HEALTH_CHECK_INTERVAL`	`10`	Seconds between health-check sweeps that expire stale MP-server registrations. `0` disables the stale-instance eviction loop; it does not affect the `/quota` L2 eviction loop (see `LMCACHE_MP_COORDINATOR_EVICTION_CHECK_INTERVAL` below).
`LMCACHE_MP_COORDINATOR_EVICTION_CHECK_INTERVAL`	`5`	L2 逐出清扫之间的秒数。`0` 禁用循环。
`LMCACHE_MP_COORDINATOR_EVICTION_RATIO`	`0.2`	每个周期逐出的跟踪键的比例（按计数，范围为 0.0 到 1.0）。
`LMCACHE_MP_COORDINATOR_TRIGGER_WATERMARK`	`1.0`	当使用量达到配额的该比例时触发逐出（取值范围为 (0.0, 1.0]，不含 0.0）。
`LMCACHE_MP_COORDINATOR_CHUNK_SIZE`	`256`	每个 KV 块的令牌数：CacheBlend 匹配单元和用于将固定的 `token_ids` 解析为键的单元。必须等于 MP 服务器的 `--chunk-size`。
`LMCACHE_MP_COORDINATOR_HASH_ALGORITHM`	`blake3`	用于 pin 密钥解析的令牌哈希算法。必须与 MP 服务器的 `--hash-algorithm` 相等。`blake3` 是自包含的；其他算法需要在协调器进程中可导入的 vLLM。
`LMCACHE_MP_COORDINATOR_BLEND_PROBE_STRIDE`	`1`	在 CacheBlend 匹配探针之间的位置。`1` 在每个偏移量进行探测以实现完全回忆。
`LMCACHE_MP_COORDINATOR_TIMEOUT_KEEP_ALIVE`	`10`	HTTP 服务器在关闭空闲连接之前保持打开的秒数。必须大于 MP 服务器的心跳间隔（默认 `5`），否则心跳请求可能会遇到关闭的连接并失败，出现 `服务器在未发送响应的情况下断开连接`。
`LMCACHE_MP_COORDINATOR_ENABLE_STARTUP_RESYNC`	`True`	当设置为 `True` 时，协调器在启动时会执行一次 L2 重新同步，分页 MP 服务器的 `GET /cache/objects` 并从现有的 L2 内容中填充使用情况和逐出跟踪器。禁用此选项可以从空跟踪器开始（适用于测试或在任何 MP 服务器之前启动协调器的部署）。
`LMCACHE_MP_COORDINATOR_RESYNC_POLL_INTERVAL`	`1`	在等待第一个 MP 服务器注册以便启动重新同步可以开始时，注册检查之间的秒数。
`LMCACHE_MP_COORDINATOR_RESYNC_MAX_WAIT`	`60`	最大秒数启动重新同步等待 MP 服务器在放弃之前。协调器保持运行，使用空的跟踪器，直到正常使用事件填充它们。
`LMCACHE_MP_COORDINATOR_RESYNC_PAGE_SIZE`	`1000`	`page_size` 在重新同步期间转发到 MP 服务器的 `GET /cache/objects`。较大的值会减少 RTT 计数；服务器会限制在其自身的上限。

连接 MP 服务器 #

当您通过 --coordinator-url 为 MP 服务器（lmcache server）指定协调器地址时，该服务器便会加入协调器。它在启动时注册、运行时持续发送心跳、关闭时注销——所有操作均在服务器自身的事件循环中完成。此功能为可选项：若未设置 URL，服务器的行为与之前完全相同。每个标志均可回退到对应的 LMCACHE_COORDINATOR_* 环境变量（在使用 Kubernetes Downward API 时非常便捷）；显式标志优先于环境变量。

标志（在 MP 服务器上）	环境回退	描述
`--coordinator-url`	`LMCACHE_COORDINATOR_URL`	协调器基础 URL，例如 `http://coordinator:9300`。设置后启用注册。
`--coordinator-advertise-ip`	`LMCACHE_COORDINATOR_ADVERTISE_IP`	协调器应通过此服务器访问的 IP（默认为服务器的外部 IP）。
`--coordinator-heartbeat-interval`	`LMCACHE_COORDINATOR_HEARTBEAT_INTERVAL`	心跳之间的秒数（必须为 `> 0`，默认值为 `5`）。保持远低于协调器的 `INSTANCE_TIMEOUT`。
`--coordinator-l2-event-reporting`	`LMCACHE_COORDINATOR_L2_EVENT_REPORTING`	启用向协调器上报 L2 存储/查找事件，用于全集群使用量跟踪及基于配额的逐出。
`--coordinator-l2-event-flush-interval`	`LMCACHE_COORDINATOR_L2_EVENT_FLUSH_INTERVAL`	L2 事件批量刷新之间的秒数（必须为 `> 0`，默认值为 `1`）。

服务器在其稳定身份下注册（--instance-id / OTel service.instance.id）；如果未传递该标志，服务器将在启动时生成一个随机的 UUID v4 并在该 UUID 下注册。

注册是尽力而为的：如果协调器无法访问，MP 服务器会记录警告，持续重试，并继续提供服务。启动时会拒绝格式错误的心跳间隔值。

HTTP 端点 #

协调器的 HTTP 接口（基础 URL http://localhost:9300）分为：

车队成员资格和健康 -- 注册和存活 (/instances, /healthz)。
配额、使用情况和逐出 -- /quota 组：每个租户的字节预算、使用情况核算，以及驱动全队逐出的使用事件摄取。
缓存控制 -- /cache 组：分派给命名服务器的缓存操作（热预取、固定/解固定和删除，更多功能即将推出）。
CacheBlend fingerprint directory -- the /blend group: the fleet-wide fingerprint index blend-enabled MP servers publish to on STORE and query on LOOKUP. Server-to-coordinator only; not usually called by hand.

每个端点的文档如下。成功状态为 200，除非另有说明，{cache_salt} 使用 _default 哨兵表示空盐。数据类型位于 lmcache/v1/mp_coordinator/schemas.py。

车队成员资格和健康状态 #

MP 服务器自动注册、心跳和注销（参见 Connecting MP servers）；GET /instances 和 GET /healthz 是只读的操作员视图。

`POST /instances`#

注册（或重新注册）一个 MP 服务器。每个服务器在启动时会自动调用。

请求体：

字段	类型	描述
`ip`	字符串	服务器 HTTP API 的 IP/主机地址；协调器拨打此地址，因此必须非空。
`http_port`	整数	服务器 HTTP API 的端口。
`instance_id`	字符串	可选。服务器标识符；如果省略（或留空），协调器将生成一个并返回。
`metadata`	对象	可选。自由格式的 `string -> string` 注册提示。
`p2p_advertised_url`	字符串	可选。服务器为点对点传输宣传的 URL；在不处于 P2P 时为空。
`mq_port`	整数	可选（默认 `0`）。ZMQ 消息队列端口，P2P 对等体发送查找/解锁 RPC 的端口；当 P2P 被禁用时为 `0`。

响应 (200 OK):

{"instance_id": "server-1", "re_registered": false}

instance_id 是注册的 ID（当请求省略时生成的 ID）；re_registered 在替换现有注册时为 true。

HTTP 状态码:

200: 已注册。
422: 请求体未通过字段级验证（例如，空白的 ip 或超出范围的 http_port）。

示例：

curl -s -X POST http://localhost:9300/instances \
    -H 'Content-Type: application/json' \
    -d '{"ip": "10.0.0.5", "http_port": 8080}'
# -> {"instance_id": "mp-3f2c9d...", "re_registered": false}

`PUT /instances/{instance_id}/heartbeat`#

记录存活心跳。在服务器运行时自动调用。

路径参数: instance_id — 记录心跳的实例。

响应 (200 OK):

{"instance_id": "server-1"}

HTTP 状态码:

200: 心跳已记录。
404: 未知实例 — 调用者应通过 POST /instances 重新注册。

示例：

curl -s -X PUT http://localhost:9300/instances/server-1/heartbeat
# -> {"instance_id": "server-1"}

`DELETE /instances/{instance_id}`#

注销一个 MP 服务器。在关闭时自动调用。

路径参数: instance_id — 要注销的服务器。

响应: 204 No Content，返回一个空主体，无论实例是否已注册（幂等）。

HTTP 状态码:

204: 已注销（对于未知实例也会返回此状态）。

示例：

curl -s -X DELETE http://localhost:9300/instances/server-1 -o /dev/null -w '%{http_code}\n'
# -> 204

`GET /instances`#

列出每个注册的 MP 服务器。

响应 (200 OK):

{
  "instances": [
    {
      "instance_id": "server-1",
      "ip": "10.0.0.5",
      "http_port": 8080,
      "registration_time": 1719000000.0,
      "metadata": {},
      "p2p_advertised_url": "",
      "mq_port": 0
    }
  ]
}

每个条目报告服务器的 instance_id、协调器访问的 ip / http_port、墙钟 registration_time``（纪元秒）、注册时提供的任何 ``metadata，以及用于点对点传输的 p2p_advertised_url / mq_port``（当 P2P 被禁用时为空 / ``0）。

HTTP 状态码:

200: 已列出舰队（空舰队返回 {\"instances\": []}）。

示例：

curl -s http://localhost:9300/instances

`GET /healthz`#

协调器存活探测（用于 Kubernetes）。

响应 (200 OK):

{"status": "healthy"}

HTTP 状态码:

200: 协调器正常运行。

示例：

curl -s http://localhost:9300/healthz
# -> {"status": "healthy"}

配额、使用情况和逐出 #

/quota 组拥有每个 cache_salt 字节预算、其背后的实时使用情况核算，以及驱动全队逐出的使用事件流。（MP 服务器暴露了一个节点本地的 /quota，形状相同；这是其全队对应的版本。）使用 _default 作为路径参数以针对空字符串盐。

警告

请勿将 MP 服务器的节点本地 /quota API 与协调器的 API 一起使用。这两者是独立的、不同步的配额注册表，在 同一共享 L2 上强制执行逐出：服务器端的强制执行者（在服务器运行每个盐的逐出策略时处于活动状态）使用严格的白名单语义——任何缺失于 其自身 表中的盐都会被完全逐出——并且它永远无法看到在协调器上注册的配额，反之亦然。混合这两者会产生相互竞争的逐出决策：服务器可以删除协调器认为在配额内的数据（或在默认限制生效之前仍然豁免的数据）。每个部署选择一个所有者——在协调器管理的部署中，仅通过协调器的 /quota API 注册配额，并保持服务器的节点本地配额表不变。

没有显式配额的盐由注册表的 默认限制 (PUT /quota/config) 管控。启动时默认值未设置，未配额的盐免于逐出 — 配额存储在内存中，因此一个新启动的（重新）协调器在外部配额控制器重新同步之前有一个空的配额表，而免配额的默认值防止在这一窗口内大规模逐出未知租户。在重新注册每个盐的配额后，控制器将默认值设置为 0 — 这是启动严格允许列表强制执行的信号（所有未配额盐下的字节将在下一个周期变为可逐出）：

# 1. re-register every tenant quota
curl -s -X PUT http://localhost:9300/quota/user-a \
    -H 'Content-Type: application/json' -d '{"limit_gb": 10.0}'
# ... one PUT per tenant ...

# 2. arm eviction of everything else
curl -s -X PUT http://localhost:9300/quota/config \
    -H 'Content-Type: application/json' -d '{"default_limit_gb": 0}'
# -> {"default_limit_gb": 0.0}

当 MP 服务器启用 --coordinator-l2-event-reporting 时，它们将 L2 store、lookup 和 delete 事件流式传输到协调器，协调器汇总每个 cache_salt 的使用情况，强制执行配额，并选择 LRU 键进行逐出。每个批次携带服务器的 instance_id 和一个单调递增的序列号 (seq)，该序列号作用于该实例，从而实现未来的间隙检测。

主动逐出循环。 每 LMCACHE_MP_COORDINATOR_EVICTION_CHECK_INTERVAL 秒，协调器检查每个盐的使用情况与注册的配额，并对任何超过触发水位线的盐选择 LRU 牺牲品，并向一个均匀随机的注册 MP 服务器发送单个 DELETE /cache/objects。由于所有 MP 服务器共享相同的后端 L2（例如，一个 S3 存储桶），一次调度会逐出整个集群的键。MP 服务器的 L2 适配器在删除完成后触发 on_l2_keys_deleted 监听器；这些监听器通过 POST /quota/events 发送 delete 事件，这就是更新协调器的 LRU + 每盐总数的方式。调度失败或没有注册实例将进入下一个周期——至少一次语义，安全，因为 S3 删除是幂等的。

启动重新同步。 在启动时，协调器最多等待 LMCACHE_MP_COORDINATOR_RESYNC_MAX_WAIT 秒以注册第一个 MP 服务器，然后对其 GET /cache/objects 进行分页，并用 L2 中已经存在的内容填充内存中的使用情况和逐出跟踪器——因此一个新的协调器不会从零使用开始。设置 LMCACHE_MP_COORDINATOR_ENABLE_STARTUP_RESYNC=False 以跳过此阶段。尽力而为：重新同步失败会被记录，管理器会放弃；来自 MP 服务器的持续使用事件流最终会纠正任何初始盲点。

`PUT /quota/config` / `GET /quota/config`#

设置/读取未明确配额条目的盐的默认限制。

请求体 (PUT):

字段	类型	描述
`default_limit_gb`	浮动或空值	null``（启动默认值）使未配额的盐免于逐出；``0 启用严格的白名单执行（所有未配额的字节在下一个周期变为可逐出）；正值为每个未配额的盐提供该字节预算。
`tier`	字符串	可选（默认 `l2`）。目前仅支持 `l2`。

响应 (200 OK):

{"default_limit_gb": 0.0}

示例：

curl -s http://localhost:9300/quota/config
# -> {"default_limit_gb": null}          (boot state: unquota'd exempt)

curl -s -X PUT http://localhost:9300/quota/config \
    -H 'Content-Type: application/json' -d '{"default_limit_gb": 0}'
# -> {"default_limit_gb": 0.0}           (allowlist enforcement armed)

`PUT /quota/{cache_salt}`#

创建或更新租户的字节预算。

路径参数: cache_salt — 租户标识符 (_default 表示空盐)。

请求体：

字段	类型	描述
`limit_gb`	浮点数	以 GiB 为单位的字节预算；必须为 >= 0``（``0 会在下一个逐出周期清除租户的数据）。
`tier`	字符串	可选（默认 `l2`）。配额适用的缓存层级；目前仅支持 `l2`。

响应 (200 OK):

{"cache_salt": "user-a", "limit_gb": 10.0, "status": "ok"}

HTTP 状态码:

200: 已应用配额。
400: 无效的限制（负值或非有限值）。
422: 请求体未通过字段级验证。

示例：

curl -s -X PUT http://localhost:9300/quota/user-a \
    -H 'Content-Type: application/json' \
    -d '{"limit_gb": 10.0}'
# -> {"cache_salt": "user-a", "limit_gb": 10.0, "status": "ok"}

`DELETE /quota/{cache_salt}`#

删除盐的配额条目。任何仍然缓存的字节将在下一个逐出周期变为超出预算（有效限制降至 0）。

路径参数: cache_salt — 租户标识符 (_default 表示空盐)。

查询参数: tier — 可选（默认 l2）；配额适用的缓存层级。

响应 (200 OK):

{"cache_salt": "user-a", "limit_gb": 0.0, "status": "removed"}

当未为盐注册配额时，status 为 "not_found"``（仍然是 ``200 OK）。

HTTP 状态码:

200: 已移除，或 not_found 如果不存在配额。

示例：

curl -s -X DELETE http://localhost:9300/quota/user-a
# -> {"cache_salt": "user-a", "limit_gb": 0.0, "status": "removed"}

`GET /quota/{cache_salt}`#

读取单个盐的配额和实时使用情况。

路径参数: cache_salt — 租户标识符 (_default 表示空盐)。

查询参数: tier — 可选（默认 l2）。

响应 (200 OK):

{"cache_salt": "user-a", "quota_limit_gb": 10.0, "quota_exists": true, "usage_gb": 0.001}

quota_limit_gb 是以 GiB 为单位的配置限制（当未设置配额时为 0.0），quota_exists 表示是否注册了显式配额，usage_gb 是当前的总使用量。此端点对于未知的 salt 永远不会返回 404。

HTTP 状态码:

200: 报告配额和使用情况。

示例：

curl -s http://localhost:9300/quota/user-a
# -> {"cache_salt": "user-a", "quota_limit_gb": 10.0, "quota_exists": true, "usage_gb": 0.001}

`GET /quota`#

列出总使用量和每个 salt 的详细信息。

查询参数: tier — 可选（默认 l2）。

响应 (200 OK):

{
  "total_gb": 0.005,
  "by_cache_salt": [
    {"cache_salt": "user-a", "quota_limit_gb": 10.0, "quota_exists": true, "usage_gb": 0.001}
  ]
}

total_gb 是所有 salt 的总使用量（以 GiB 为单位）；每个 by_cache_salt 条目具有与 GET /quota/{cache_salt} 响应相同的字段。

HTTP 状态码:

200: 报告的使用情况。

示例：

curl -s http://localhost:9300/quota
# -> {"total_gb": 0.005, "by_cache_salt": [...]}

`POST /quota/events`#

摄取一批使用事件。由报告 MP 服务器自动发送；通常不手动调用。

请求体：

字段	类型	描述
`instance_id`	字符串	生成此批次的 MP 服务器。
`seq`	整数	每个实例的单调序列号（`>= 1`）；支持未来丢失批次的间隙检测。
`tier`	字符串	可选（默认 `l2`）。事件适用的缓存层。
`events`	列表[对象]	要记录的事件。每个事件为 `{\"type\", \"key\", \"bytes\"}`： `type` 为 `\"store\"`、`\"lookup\"` 或 `\"delete\"`； `key` 是编码后的对象键； `bytes` (`>= 0`) 是存储的大小 — 对于 `store` 计入，对于 `lookup` / `delete` 忽略（`delete` 会减去在原始 `store` 记录的大小）。

响应 (200 OK):

{"recorded": 3}

recorded 是处理的事件数量。

HTTP 状态码:

200: 处理的事件数量。
422: 请求体未通过字段级验证。

示例：

curl -s -X POST http://localhost:9300/quota/events \
    -H 'Content-Type: application/json' \
    -d '{
        "instance_id": "server-1",
        "seq": 1,
        "events": [
            {"type": "store",  "key": {"chunk_hash_hex": "aa", "model_name": "m", "kv_rank": 0, "cache_salt": "user-a"}, "bytes": 1024},
            {"type": "lookup", "key": {"chunk_hash_hex": "aa", "model_name": "m", "kv_rank": 0, "cache_salt": "user-a"}, "bytes": 0},
            {"type": "delete", "key": {"chunk_hash_hex": "aa", "model_name": "m", "kv_rank": 0, "cache_salt": "user-a"}, "bytes": 0}
        ]
    }'
# -> {"recorded": 3}

缓存控制 #

/cache 组将缓存操作分派到命名的 MP 服务器。它涵盖 热预取、固定/解固定 和删除；进一步的缓存控制操作将在此处作为端点记录。

热预取（从 L2 预加载 L1）。 在请求到达之前，用已知提示的 KV 预热一个 MP 服务器的 L1，以便第一个请求命中 L1，而不是支付 L2 内联获取的费用——当您知道工作负载即将路由到节点（流量转移、热共享系统提示）时，这非常有用。

您通过 token ids 描述内容——缓存使用的单位——而不是通过内部缓存键，您无法构造这些键（键是内容哈希加上每个排名的布局位图）。协调器将请求转发到指定的服务器，该服务器对令牌进行哈希，将它们扩展到节点的各个排名，从 L2 加载块到 L1，并保留它们，以便后续查找命中。提交返回一个 request_id；轮询状态端点直到 completed。预热不需要获取锁——轮询仅报告进度，并在加载完成后清除服务器端作业。

`POST /cache/prefetches`#

在一个指定的服务器上提交一个令牌序列的预热预取。

请求体：

字段	类型	描述
`instance_id`	字符串	目标 MP 服务器；必须已注册。
`model_name`	字符串	目标的 L1 缓冲区布局大小的模型。
`world_size`	整数	世界大小（`>= 1`），选择 KV 布局和每个 rank 的扇出（对于单 GPU，TP=1 部署为 `1`）。
`token_ids`	列表[int]	完全 `chunk_size` 块的提示令牌已预热；必须与存储的内容匹配（相同的分词器/特殊令牌）。子块序列是 `noop`。
`cache_salt`	字符串	可选（默认 `""`）。应用于生成键的每个租户隔离盐。

响应 (200 OK):

{"instance_id": "server-1", "request_id": "abc123", "chunks": 12, "status": "submitted"}

当序列短于一个块时，什么都不会被提交，request_id 为空：

{"instance_id": "server-1", "request_id": "", "chunks": 0, "status": "noop"}

request_id 是用于轮询的 ID；chunks 是提交的完整块的数量。

HTTP 状态码:

200：已提交（或如上所述的 noop）。
404: 未知 ``instance_id``（未注册）。
502: 目标服务器无法访问或拒绝了提交。
422: 请求体未通过字段级验证。

备注

单节点范围： 一个 instance_id 仅对该节点的分片进行预热。对于跨多个节点的模型，每个节点的实例需提交一个请求。

示例：

curl -s -X POST http://localhost:9300/cache/prefetches \
    -H 'Content-Type: application/json' \
    -d '{
        "instance_id": "server-1",
        "model_name": "Qwen/Qwen3-8B",
        "world_size": 1,
        "token_ids": [101, 102, 103, "..."],
        "cache_salt": "user-a"
    }'
# -> {"instance_id": "server-1", "request_id": "abc123", "chunks": 12, "status": "submitted"}

`GET /cache/prefetches/{instance_id}/{request_id}`#

轮询已提交的预热预取；响应将拥有服务器的状态及其代码逐字传递。

路径参数：

字段	类型	描述
`instance_id`	字符串	提交预取的服务器。
`request_id`	字符串	由 `POST /cache/prefetches` 返回的 id。

响应 (200 OK) 在加载运行时：

{"status": "pending"}

…并且一旦完成：

{"status": "completed", "found_keys": 12, "total_keys": 12}

found_keys 的 total_keys 请求块是常驻的。

HTTP 状态码:

200: 状态报告（pending 或 completed）。
404: 未知的 instance_id，或来自服务器的未知 request_id。
502: 目标服务器无法访问。

示例：

curl -s http://localhost:9300/cache/prefetches/server-1/abc123
# -> {"status": "completed", "found_keys": 12, "total_keys": 12}

固定/解固定（保护缓存不被逐出）。 固定一个令牌序列的缓存，以便在解固定之前不会从 L2 中被逐出。协调器将令牌序列解析为其对象键 **本地**（无需 MP 服务器往返），并将其记录在 L2 逐出计划中（POST）或释放它们（DELETE），将固定的键排除在基于配额的逐出之外。L2 固定是全舰队范围的（按 cache_salt），因此没有指定目标实例。

本地解析要求协调器的 chunk_size 和 hash_algorithm``（请参见 `Configuration`_）与 MP 服务器的 ``--chunk-size / --hash-algorithm 匹配；否则解析的键将与存储的键不匹配，且固定不会保护任何内容。它还要求 MP 服务器以 --no-separate-object-groups 启动（协调器在单个对象组中解析键）。

`POST /cache/pins`#

在 L2 逐出计划中固定一个令牌序列的键。

请求体：

字段	类型	描述
`model_name`	字符串	用于解析键时使用的模型的扇出等级。
`world_size`	整数	世界大小（`>= 1`），选择每个排名的扇出。
`token_ids`	列表[int]	完整块被固定的提示令牌；必须与存储的内容匹配。子块序列不固定任何内容（`affected` 0）。
`cache_salt`	字符串	可选（默认 `""`）。每个租户的隔离盐。

响应 (200 OK):

{"requested": 12, "affected": 12, "status": "pinned"}

requested 是解析的完整块数；affected 是固定的 L2 键的数量（块数乘以每个排名的分支因子）。

HTTP 状态码:

200: 已固定。
400: token_ids 超过每个请求的上限，或者 cache_salt 违反了其不变性。
422: 请求体未通过字段级验证。

示例：

curl -s -X POST http://localhost:9300/cache/pins \
    -H 'Content-Type: application/json' \
    -d '{
        "model_name": "Qwen/Qwen3-8B",
        "world_size": 1,
        "token_ids": [101, 102, 103, "..."],
        "cache_salt": "user-a"
    }'
# -> {"requested": 12, "affected": 12, "status": "pinned"}

备注

需要 L2 事件报告。 协调器只能排除其正在跟踪的盐的键的逐出，这需要 MP 服务器以 --coordinator-l2-event-reporting 启动（请参见连接 MP 服务器）。

`DELETE /cache/pins`#

从 L2 逐出计划中取消一个令牌序列的键。请求体与 POST /cache/pins 相同。响应与 pin 相似（affected 是取消固定的键的数量），status 为 "unpinned"。Pins 是按引用计数的：一个块被固定 N 次需要 N 次取消固定才能被逐出。

HTTP 状态码: 与 POST /cache/pins 相同。

示例：

curl -s -X DELETE http://localhost:9300/cache/pins \
    -H 'Content-Type: application/json' \
    -d '{
        "model_name": "Qwen/Qwen3-8B",
        "world_size": 1,
        "token_ids": [101, 102, 103, "..."],
        "cache_salt": "user-a"
    }'
# -> {"requested": 12, "affected": 12, "status": "unpinned"}

删除（通过令牌序列移除缓存）。 在一个指定的服务器上删除令牌序列的缓存，通过令牌 ID 定址。协调器将令牌解析为本地对象键（如 pin），并向指定服务器发出单个键地址的 DELETE /cache/objects 请求，从请求的层级中移除它们。tier 字段选择层级：l1 仅删除指定服务器的 L1，l2 仅删除 L2，all 则删除两者。当层级包括 L2 时，协调器首先从删除集合中删除任何受 L2 pin 保护的键，除非设置了 force — 因此，任何被 pin 的键在删除操作中会保留在每个被触及的层级中；force 会删除它们并移除这些 pins。

`POST /cache/delete`#

在一个指定的服务器上删除一个令牌序列。

请求体： (model_name, world_size, token_ids, cache_salt) 以及 tier (l1 / l2 / all) 和 force (布尔值，默认 false)。当 force 为 true 时，锁定的键仍然会被删除（节点上的 L1 读/写锁和协调器的 L2 固定集）。

响应 (200 OK):

{"instance_id": "server-1", "requested": 12, "affected": 24, "skipped": 0, "status": "deleted"}

requested 是已解析的完整块的数量。 affected 和 skipped 是 作用于的各层的总计： affected 计算节点移除的 L1 键加上协调器移除的 L2 键，而 skipped 计算节点拒绝的 L1 键加上因 L2 钉住而保留的 L2 键（仅限非强制）。同时存在于两个层中的块（tier=all）会对两个计数都有贡献，因此 affected 可能高达 2 x requested x world_size。一个子块序列返回 status "noop"。

HTTP 状态码:

200: 已删除（或为 noop）。
404：没有服务器在 instance_id 下注册。
502: 目标服务器无法访问或拒绝了删除请求。

示例：

curl -s -X POST http://localhost:9300/cache/delete \
    -H 'Content-Type: application/json' \
    -d '{
        "instance_id": "server-1",
        "model_name": "Qwen/Qwen3-8B",
        "world_size": 1,
        "token_ids": [101, 102, 103, "..."],
        "cache_salt": "user-a",
        "tier": "all",
        "force": false
    }'
# -> {"instance_id": "server-1", "requested": 12, "affected": 24, "skipped": 0, "status": "deleted"}

CacheBlend fingerprint directory #

The /blend group is the fleet-wide fingerprint index behind CacheBlend cross-request reuse. Blend-enabled MP servers publish chunk fingerprints on STORE (POST /blend/fingerprints) and query them on LOOKUP (POST /blend/match); the index is chunked at LMCACHE_MP_COORDINATOR_CHUNK_SIZE tokens and rolling-hash-matched at LMCACHE_MP_COORDINATOR_BLEND_PROBE_STRIDE. These endpoints are server-to-coordinator; they are not usually called by hand.

The wire types (StoreRangeModel, BlendFingerprintRequest, BlendMatchRequest, GlobalMatchModel and their responses) live in lmcache/v1/mp_coordinator/schemas.py.

`POST /blend/fingerprints`#

Register stored chunk fingerprints (idempotent).

请求体：

字段	类型	描述
`ranges`	list	Stored token ranges. Each entry carries `model_scope` (the reuse scope, typically the model name), `tokens` (the raw stored token ids), `object_keys` (the shared-L2 storage key per chunk, in order), and `old_st_base` (the token position of the range's first token). The directory chunks `tokens` at the coordinator's chunk size and hashes each chunk; chunk `i` maps to `object_keys[i]`.

响应 (200 OK):

{"inserted": 3}

inserted reports how many fingerprints were newly registered (existing entries are left in place; re-publishing is safe).

HTTP 状态码:

200: fingerprints processed.
422: 请求体未通过字段级验证。

`DELETE /blend/fingerprints`#

Evict fingerprints by shared-L2 storage key (idempotent). MP servers call this when the underlying L2 objects are dropped, so the directory does not keep handing out stale matches.

请求体：

字段	类型	描述
`object_keys`	list[string]	Storage keys whose fingerprint entries should be removed. Unknown keys are ignored.

响应 (200 OK):

{"removed": 2}

removed reports how many entries were actually evicted.

HTTP 状态码:

200: keys processed.
422: 请求体未通过字段级验证。

`POST /blend/match`#

Match a request's token buffer against the directory and return the reusable chunks.

请求体：

字段	类型	描述
`model_scope`	字符串	Reuse scope to match within (typically the model name).
`tokens_b64`	字符串	Request tokens packed as base64 little-endian `uint32` (see `encode_tokens` / `decode_tokens` in `schemas.py`). The coordinator hashes them at `LMCACHE_MP_COORDINATOR_CHUNK_SIZE` positions striding by `LMCACHE_MP_COORDINATOR_BLEND_PROBE_STRIDE`.

响应 (200 OK):

{
  "matches": [
    {"object_key": "ab12...", "old_st": 0,    "cur_st": 512},
    {"object_key": "cd34...", "old_st": 256,  "cur_st": 768}
  ]
}

Each entry names one reusable chunk: object_key is the shared-L2 key, old_st is its token position in the stored sequence (re-RoPE source), and cur_st is the position in the request (re-RoPE target). Matches are sorted ascending by cur_st. An empty request or an unknown model_scope returns {"matches": []}.

HTTP 状态码:

200: match completed (an empty match list is not an error).
422: tokens_b64 is not valid base64 or not a whole number of uint32 tokens.