HTTP API#
当 MP 服务器通过 lmcache server 启动时(推荐入口点),会同时暴露一个基于 FastAPI 的 HTTP 前端和 vLLM 使用的 ZMQ 套接字。此 HTTP API 面向运维人员、编排工具(例如 Kubernetes)和调试工具——它不位于推理数据路径上。
Where the routes come from#
Routes are assembled from three sources, all merged into one FastAPI app by
HTTPAPIRegistry at startup:
MP-native routes — any module named
*_api.pyunderlmcache/v1/multiprocess/http_apis/that exposes a module-levelrouter(afastapi.APIRouter) is auto-discovered. This covers the operational surface: status, cache control, L2 management, quota, and runtime reconfiguration.Shared "common" routes —
lmcache/v1/multiprocess/http_apis/common_api.pyaggregates every compatible router underlmcache/v1/internal_api_server/common/(skipping any module listed in_MP_INCOMPATIBLE_MODULES, currently empty) and forwards them to the auto-discovery pipeline. These are the cross-server diagnostics shared with the vLLM-embedded API server (/env,/loglevel,/metrics,/threads,/periodic-threads*,/run_script). Adding a new compatible module underinternal_api_server/commonrequires no wiring changes on the MP side.Re-exported version routes —
lmcache/v1/multiprocess/http_apis/version_api.pyre-exports the router fromlmcache/v1/internal_api_server/vllm/version_api.py, exposing/version,/lmc_version, and/commit_id.
服务器配置#
参数 |
默认 |
描述 |
|---|---|---|
|
|
绑定 HTTP 服务器的主机。 |
|
|
绑定 HTTP 服务器的端口。 |
示例:
lmcache server \
--l1-size-gb 100 --eviction-policy LRU \
--http-host 0.0.0.0 --http-port 8080
以下所有示例都假设服务器可以通过 http://localhost:8080 访问。
Endpoint Overview#
The routes are grouped by purpose below. The operational surface (health,
status, cache and storage control) lives at top-level paths; routes inherited
from the shared internal_api_server package keep their original paths for
compatibility with the vLLM-embedded API server.
备注
Several handlers report failure in the response body rather than via a
non-200 status code (e.g. DELETE /l2 returns 200 with ok=false,
and /periodic-threads-health returns 200 with healthy=false).
The error-field name is also not uniform: /healthcheck and
/clear-cache use reason on failure, while /status, /conf,
and /kvcache/check use error. Per-endpoint details below are
authoritative.
Liveness and health
方法 |
路径 |
目的 |
|---|---|---|
GET |
|
Static liveness ping (does not touch the engine). |
GET |
|
K8s liveness/readiness probe; |
Inspection and status
方法 |
路径 |
目的 |
|---|---|---|
GET |
|
Detailed engine snapshot (L1, L2, registered contexts, sessions, prefetch jobs) for inspection and debugging. |
GET |
|
Dump the merged server configuration objects ( |
GET |
|
Combined version string ( |
GET |
|
LMCache 包版本字符串。 |
GET |
|
Build commit id. |
Cache control
方法 |
路径 |
目的 |
|---|---|---|
POST |
|
强制清除 L1 (CPU) 内存中的所有 KV 数据。 |
GET |
|
Compute MD5 checksums over the engine KV cache for a set of block IDs (diagnostics / round-trip integrity checks). |
L2 storage management
方法 |
路径 |
目的 |
|---|---|---|
GET |
|
Enumerate every configured L2 adapter with its |
DELETE |
|
Delete a caller-supplied list of keys from one L2 adapter (default:
primary; override with |
GET |
|
Paginate keys currently resident in one L2 adapter (optionally
filtered by |
Quota management
方法 |
路径 |
目的 |
|---|---|---|
GET |
|
列出每个注册的 |
PUT |
|
设置或更新 |
GET |
|
读取单个 |
DELETE |
|
移除 |
Runtime L2 reconfiguration
方法 |
路径 |
目的 |
|---|---|---|
GET |
|
List backend strings accepted by the reconfiguration routes. |
GET |
|
列出指定后端类型下可在运行时管理的 L2 适配器。 |
POST |
|
对后端适配器应用一个运行时重新配置操作。 |
Observability
方法 |
路径 |
目的 |
|---|---|---|
GET |
|
Prometheus 展示格式。 |
POST |
|
将所有可观察性指标重置为初始状态。 |
Diagnostics and debugging
方法 |
路径 |
目的 |
|---|---|---|
GET |
|
List or inspect logger levels; also accepts |
GET |
|
列出活动的 Python 线程及其堆栈跟踪。 |
GET |
|
列出注册的周期性线程及其摘要计数。 |
GET |
|
单个周期线程的详细状态。 |
GET |
|
对关键/高层周期线程的快速健康检查。 |
GET |
|
Dump process environment variables (JSON body, |
POST |
|
Execute an uploaded Python script in a restricted sandbox. |
Liveness and Health#
GET /#
Basic liveness check. Returns a static payload indicating the HTTP server
is running; it does not touch the cache engine. Use /healthcheck
instead for probes that also verify the engine is initialized.
响应 (200 OK):
{
"status": "ok",
"service": "LMCache HTTP API"
}
示例:
curl -s http://localhost:8080/
GET /healthcheck#
Health check endpoint suitable for Kubernetes liveness and readiness
probes. A 200 response means the HTTP server is alive and the MP
cache engine object is wired onto app.state. A 503 response
indicates the engine is not yet present (still initializing, or failed to
initialize). The check verifies that the engine attribute is set; it does
not call into the engine to assert deeper liveness.
响应 (200 OK):
{
"status": "healthy"
}
响应 (503 服务不可用):
{
"status": "unhealthy",
"reason": "engine not initialized"
}
示例:
curl -s http://localhost:8080/healthcheck
Kubernetes 探针代码片段:
livenessProbe:
httpGet:
path: /healthcheck
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /healthcheck
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Inspection and Status#
GET /status#
Returns a detailed snapshot of the MP engine's internal state. The payload is
assembled by MPCacheServer.report_status(): a fixed set of engine-level
fields, the full storage-manager status, plus whatever keys each loaded module
contributes (so the exact key set depends on which modules are active —
registered_gpu_ids / cache_context_meta come from the transfer module,
active_prefetch_jobs from the lookup module, and blend modes add their own
fields). Intended for operators and debugging, not for monitoring (use
Prometheus metrics for time-series data — see 可观察性).
响应 (200 OK):
{
"is_healthy": true,
"engine_type": "MPCacheServer",
"chunk_size": 256,
"hash_algorithm": "builtin-hash",
"active_sessions": 2,
"registered_gpu_ids": [0, 1],
"cache_context_meta": {
"0": {
"model_name": "meta-llama/Llama-3.1-8B-Instruct",
"world_size": 1,
"kv_cache_layout": {
"num_layers": 32,
"num_blocks": 12345,
"cache_size_per_token": 131072,
"kernel_groups": [
{
"kernel_group_idx": 0,
"engine_group_idx": 0,
"object_group_idx": 0,
"num_layers": 32,
"layer_indices": [0, 1, "..."],
"tokens_per_block": 16,
"slots_per_block": 16,
"dtype": "torch.bfloat16",
"engine_kv_concrete_shape": "...",
"is_mla": false,
"engine_kv_format": "...",
"engine_kv_shape": "...",
"attention_backend": "..."
}
]
}
}
},
"active_prefetch_jobs": 0,
"storage_manager": {
"is_healthy": true,
"...": "backend-specific fields"
}
}
响应 (503 服务不可用) 当引擎尚未初始化时:
{
"error": "engine not initialized"
}
示例:
curl -s http://localhost:8080/status | jq
GET /conf#
以单一缩进 JSON 文档返回注册在 app.state.configs 上的全部服务器端配置对象(通常包括 mp、storage_manager 和 observability)。数据类通过 safe_asdict 序列化;其他值通过 make_json_safe 处理。可用于确认进程实际加载的配置内容(含环境变量覆盖),无需重启服务。
响应 (200 OK):
{
"mp": {
"http_host": "0.0.0.0",
"http_port": 8080,
"...": "..."
},
"storage_manager": {
"...": "..."
},
"observability": {
"...": "..."
}
}
响应(503 服务不可用),当配置尚未连接到 app.state 时:
{
"error": "configs not initialized"
}
示例:
curl -s http://localhost:8080/conf | jq
GET /version, GET /lmc_version, GET /commit_id#
Version descriptors. Each returns a bare JSON string (not an object):
GET /version— the combined descriptor fromlmcache.utils.get_version(), formatted"<version>-<commit_id>"(e.g."0.3.1-ca79ea33"). On a source checkout without build-time version metadata, each missing component falls back to the literal"NA"(so a metadata-less build returns"NA-NA").GET /lmc_version— the raw package version string (lmcache.utils.VERSION); empty string""when the generatedlmcache._versionmodule is absent.GET /commit_id— the git commit id baked into the build (lmcache.utils.COMMIT_ID); empty string""when unavailable.
All three are unconditional 200 OK.
示例:
curl -s http://localhost:8080/version
curl -s http://localhost:8080/lmc_version
curl -s http://localhost:8080/commit_id
Cache Control#
POST /clear-cache#
Force-clears all KV cache data currently held in L1 (CPU) memory
(delegates to the ManagementModule).
警告
此端点为破坏性操作,会绕过读/写锁。正在进行的存储或预取操作可能因此损坏。仅在服务器空闲时,或在已知缓存状态异常需要恢复时使用。
请求体将被忽略。
响应 (200 OK):
{
"status": "ok"
}
响应 (503 服务不可用):
{
"status": "error",
"reason": "engine not initialized"
}
示例:
curl -s -X POST http://localhost:8080/clear-cache
GET /kvcache/check#
Compute MD5 checksums over the engine KV cache, grouped chunk_size blocks
per hashed chunk. MP mode addresses KV storage by block IDs natively (the
same units used by STORE / RETRIEVE), so the endpoint is fully
block-centric: block_ids enumerates the target blocks and
chunk_size counts blocks per chunk. Intended for diagnostics and
round-trip integrity checks from lmcache bench server — not for the
inference data path.
查询参数:
名称 |
必需的 |
描述 |
|---|---|---|
|
是 |
Engine block IDs in mixed format, e.g. |
|
是 |
正整数 — 每个哈希块的块数。 |
|
否(默认 |
Registered KV context ID on the engine. |
|
否(默认 |
如果为 |
响应 (200 OK):
{
"status": "success",
"chunk_size": 2,
"num_chunks": 2,
"chunk_checksums": ["<md5>", "<md5>"],
"layerwise": false,
"block_id_ranges": "0,[2,5],8"
}
当 layerwise=true 时,chunk_checksums 是一个以 "layer_<idx>" 为键的字典,其值是逐层列表。
HTTP 状态码:
200: 成功。400:block_ids缺失/格式错误,或chunk_size缺失或非正数。404:instance_id未注册,或者注册的 KV 张量为空。501: 引擎没有cache_contexts,或者该端点不支持 KV 格式(页面缓冲融合和跨层布局在真正需要之前被拒绝)。503: 引擎尚未在app.state上初始化。
示例:
curl -s "http://localhost:8080/kvcache/check?block_ids=0,1,2,3&chunk_size=2"
curl -s "http://localhost:8080/kvcache/check?block_ids=0,1,2,3&chunk_size=2&layerwise=true"
L2 Storage Management#
Three endpoints — GET /l2/adapters, DELETE /l2, and
GET /l2/keys — let operators enumerate the configured L2
backends, purge keys from one, and enumerate what is currently
resident.
DELETE /l2 and GET /l2/keys accept an optional
?adapter=<type_name> query parameter to target a specific adapter.
Omit the selector to target the primary (first-configured)
adapter — the v1 behavior, preserved for clients that don't care
about multi-adapter deployments. When multiple adapters share a
type_name, the first match wins. Use GET /l2/adapters to learn
the valid selectors.
All three are intended for operator / admin workflows ("purge this user's keys", "show me what's resident", "garbage-collect orphans after a rename"). They are not on the inference data path.
L1 is intentionally not touched. Keys deleted from L2 may still return from L1 until the L1 eviction controller expires them naturally; callers that need an L1+L2 purge should layer their own L1 invalidation or wait for natural L1 eviction.
The coordinator's eviction loop uses DELETE /l2 automatically (see
多服务器协调 — "L2 usage tracking and eviction"); the
GET /l2/keys endpoint also powers the coordinator's startup
resync. Manual curl usage is reserved for ad-hoc operator
actions and debugging.
For full request/response semantics, pagination, error codes, and the
event flow back to the coordinator, see the design doc at
docs/design/v1/multiprocess/l2_apis.md.
GET /l2/adapters#
Enumerate every L2 adapter the engine has loaded, in configuration order.
响应 (200 OK):
{
"adapters": [
{"index": 0, "type_name": "S3L2Adapter", "primary": true},
{"index": 1, "type_name": "FSL2Adapter", "primary": false}
]
}
primary is true only on the first entry. An engine that has
no L2 backends returns {"adapters": []} (still 200 — the
engine is initialized, it just has no L2 storage).
HTTP 状态码:
200: success (including the no-adapters case).503: engine not initialized.
示例:
curl -s http://localhost:8080/l2/adapters | jq
DELETE /l2#
Delete a caller-supplied list of keys from one L2 adapter. Idempotent: keys absent from the adapter are skipped silently; keys currently locked by in-flight store/load tasks are skipped so the delete never corrupts an active transfer. The blocking adapter call is run off the event loop.
查询参数:
名称 |
默认 |
描述 |
|---|---|---|
|
primary |
|
Per-key successful deletions fire on_l2_keys_deleted on the
adapter's listeners — when the coordinator is wired (see
--coordinator-l2-event-reporting), the deletions show up at the
coordinator's POST /l2/events as "type": "delete" events. The
coordinator's eviction + usage trackers learn about the deletion from
that event flow, not from the response of this call.
Body: {"keys": [EncodedObjectKey, ...]} where each
EncodedObjectKey is
{
"chunk_hash_hex": "abc123...",
"model_name": "meta-llama/Llama-3-8B",
"kv_rank": 0,
"object_group_id": 0,
"cache_salt": "user-a"
}
object_group_id (default 0) and cache_salt (default "")
are optional for backward compatibility with older wire payloads. The
batch is capped at 10000 keys per request.
响应 (200 OK):
{
"requested": 2,
"adapter": "S3L2Adapter",
"ok": true
}
On adapter-level failure the response is still 200 with
ok=false and an error field carrying the reason.
HTTP 状态码:
200: request reached the adapter (checkokfor outcome).400: batch exceeds the limit, or a key payload violates anObjectKeyinvariant (bad hex,@inmodel_name, forbiddencache_saltcharacter).404:?adapter=<name>does not match any configured adapter.422: Pydantic-level body-shape failure (missingkeys, wrong field types).503: engine not initialized, or no L2 adapters configured.
示例:
curl -s -X DELETE http://localhost:8080/l2 \
-H 'Content-Type: application/json' \
-d '{
"keys": [
{"chunk_hash_hex": "aa", "model_name": "m",
"kv_rank": 0, "object_group_id": 0, "cache_salt": "user-a"}
]
}'
GET /l2/keys#
Paginate keys currently resident in one L2 adapter.
查询参数:
名称 |
默认 |
描述 |
|---|---|---|
|
primary |
|
|
none |
Restrict the result to keys whose |
|
|
Max entries per page. Must be in |
|
none |
Opaque cursor from the previous page's |
The page token is private to the adapter; do not parse or modify it.
Adapters that support listing (currently only the S3 adapter via
ListObjectsV2) guarantee best-effort consistency, not snapshot
isolation — concurrent stores or deletes during a paginated walk may
cause keys to appear, disappear, or shift between pages.
响应 (200 OK):
{
"adapter": "S3L2Adapter",
"entries": [
{
"key": {
"chunk_hash_hex": "abc123",
"model_name": "meta-llama/Llama-3-8B",
"kv_rank": 0,
"object_group_id": 0,
"cache_salt": "user-a"
},
"size_bytes": 4194304
}
],
"next_page_token": "opaque-cursor-string"
}
next_page_token is null when the listing is exhausted.
HTTP 状态码:
200: 成功。400: malformedpage_token(adapter-level).404:?adapter=<name>does not match any configured adapter.422:page_sizeoutside[1, 5000].501: selected adapter does not implement listing. In v1 onlyS3L2Adapterdoes; adapters wrapped bySerdeL2AdapterWrapperinherit the wrapped adapter's behavior.503: engine not initialized, or no L2 adapters configured.
Example: paginate every key for a model.
next=""
while :; do
page=$(curl -s "http://localhost:8080/l2/keys?model_name=meta-llama/Llama-3-8B&page_size=500&page_token=$next")
echo "$page" | jq '.entries[]'
next=$(echo "$page" | jq -r '.next_page_token // empty')
[ -z "$next" ] && break
done
Quota Management#
这些端点管理由 IsolatedLRU 逐出策略(通过 --eviction-policy IsolatedLRU 选择)消耗的每个 cache_salt 存储预算。配额是 软性 的:设置限制并不会拒绝写入 — 任何超出预算的 cache_salt 会在下一个逐出周期(约 1 秒)被逐出。没有注册配额的 cache_salt 有一个有效限制为 0 字节,因此其数据将在下一个周期被清除(白名单语义)。
对于未使用 --eviction-policy IsolatedLRU 启动的引擎,这些端点为空操作:QuotaManager 仍然存在,但 LRU 策略会忽略已注册的配额。
URL escaping for the empty salt. cache_salt="" (un-salted /
anonymous traffic) cannot appear in a URL path parameter, so the API
accepts the sentinel _default in its place. PUT /quota/_default
sets the quota for cache_salt="", and _default is echoed back in
responses for the empty salt. A user that legitimately stores data with
cache_salt="_default" cannot be managed via this HTTP API distinctly
from anonymous traffic — both map to the same path parameter; pick any
other value (e.g. "default") to disambiguate.
PUT /quota/{cache_salt}#
创建或更新配额。
主体: {\"limit_gb\": <float>} (必需,有限,非负)。
响应 (200 OK):
{"cache_salt": "alice", "limit_gb": 10.0, "status": "ok"}
错误: 400 表示 JSON 格式错误、缺少 limit_gb、limit_gb 不是数字、nan / inf 或负值;503 表示引擎未初始化。
示例:
curl -s -X PUT http://localhost:8080/quota/alice \
-H 'Content-Type: application/json' \
-d '{"limit_gb": 10.0}'
GET /quota/{cache_salt}#
读取当前配额和一个 cache_salt 的实时使用情况。
响应 (200 OK):
{
"cache_salt": "alice",
"limit_gb": 10.0,
"current_usage_gb": 2.137,
"exists": true
}
exists is false when no quota was ever registered for this
cache_salt (limit_gb is then 0.0 and current_usage_gb
reflects whatever bytes are currently cached for that salt — those bytes
will evict next cycle under IsolatedLRU). This endpoint never returns
404 for an unknown salt.
DELETE /quota/{cache_salt}#
删除 cache_salt 的配额条目。任何仍然缓存于此 cache_salt 下的字节将在下一个逐出周期中超出预算(有效限制降至 0),并将被逐出。
响应 (200 OK):
{"cache_salt": "alice", "status": "removed"}
当给定的 cache_salt 没有注册配额时,响应为 {\"cache_salt\": \"...\", \"status\": \"not_found\"}(仍然是 200 OK)。
GET /quota#
列出每个注册的配额及其实时使用情况。
响应 (200 OK):
{
"users": {
"alice": {"limit_gb": 10.0, "current_usage_gb": 2.137},
"bob": {"limit_gb": 4.0, "current_usage_gb": 0.812}
}
}
Only cache_salt values with a registered quota appear; the empty
salt is reported under the _default key.
Runtime L2 Reconfiguration#
这些端点在服务器具有可运行时重新配置的 L2 适配器时可用。它们仅更改 LMCache 的运行时映射和元数据;后端资源如 DAX 设备路径必须已经存在,并且服务器必须能够读取和写入。该端点将 backend、operation 和 JSON 请求体路由到通用 L2 适配器重新配置 API,而后端特定的验证和迁移语义则保留在适配器内部。
backend and operation path segments are normalized (stripped and
lower-cased). Within a request body, adapter_index (default 0) is
backend-local — it indexes only the adapters of that backend, not the
engine-wide adapter list. If an L2 adapter is wrapped by serde, the backend
string is still the configured L2 adapter type, not the serde wrapper type.
GET /reconfigure/backends#
List the backend strings that can be used in
/reconfigure/{backend}/status and
/reconfigure/{backend}/{operation}.
响应 (200 OK):
{
"enabled": true,
"num_backends": 1,
"backends": ["dax"]
}
enabled is false (and backends empty) when no reconfigurable
adapter is present.
HTTP status codes: 200 on success; 503 if the engine is not
initialized.
GET /reconfigure/{backend}/status#
Report the runtime-manageable adapters for one backend type. Each adapter
entry's adapter_index is rewritten to its backend-local 0-based index
(the value to pass back in operation request bodies).
响应 (200 OK):
{
"enabled": true,
"backend": "dax",
"num_adapters": 1,
"adapters": [
{"adapter_index": 0, "...": "backend-specific adapter fields"}
]
}
An unknown or empty backend returns enabled=false, num_adapters=0,
adapters=[] (it is not a 404).
HTTP status codes: 200 on success; 400 if backend is empty;
503 if the engine is not initialized.
POST /reconfigure/{backend}/{operation}#
Apply one reconfiguration operation to a backend adapter. The request body is
a JSON object whose accepted fields depend on the backend and operation. The
200 response is whatever the storage manager's
reconfigure_l2_adapter returns (a backend-defined dict).
For the generic path (any backend other than dax), the body carries
adapter_index plus any backend-specific fields, which are forwarded
verbatim to the adapter.
For Device-DAX (backend=dax), JSON request bodies are used because DAX
paths contain slashes. The accepted operations and fields are:
Operation |
Body fields |
|---|---|
|
|
|
|
|
|
size accepts an integer byte count or a string with a base-1024 unit
suffix (b, kib, mib, gib, tib and the k/m/g/t
aliases), e.g. "100GiB"; it must resolve to a positive value.
HTTP 状态码:
200: success (body is the storage manager's reconfigure result).400: emptybackend/operation, an unsupported DAX operation, or an invalidsize.404:adapter_indexis out of range for the backend.422: request body fails validation (e.g. a missing required field, or an unknown field in a DAX body — DAX bodies reject extras).503: engine not initialized.
有关详细的请求示例、模式语义和验证指导,请参阅 设备-DAX (/dev/dax)。
Observability#
GET /metrics#
Prometheus exposition format for every metric registered on the default
prometheus_client registry (Content-Type: text/plain). Scrape this
directly from Prometheus. See 可观察性 for the list of
exported metrics.
示例:
curl -s http://localhost:8080/metrics
POST /metrics/reset#
将所有 LMCache 可观察性指标重置为初始状态(reset_observability_metrics)。面向测试框架和基准测试,不适用于生产环境。
Response (200 OK, text/plain):
ok
示例:
curl -s -X POST http://localhost:8080/metrics/reset
Diagnostics and Debugging#
GET /loglevel#
在运行时检查或修改 Python 日志记录器级别。所有响应都是 text/plain。该端点有三种模式,由查询参数驱动:
查询 |
行为 |
|---|---|
(无参数) |
列出所有在 |
|
返回指定记录器的有效级别。 |
|
Set the named logger (and its handlers) to |
Passing level without logger_name matches none of the modes and
returns 200 with a null body.
示例:
# list everything
curl -s http://localhost:8080/loglevel
# read one
curl -s 'http://localhost:8080/loglevel?logger_name=lmcache'
# elevate to DEBUG
curl -s 'http://localhost:8080/loglevel?logger_name=lmcache&level=DEBUG'
GET /threads#
Enumerate active Python threads in the server process along with their
stack traces, plus a total-count summary (Content-Type: text/plain).
Useful for live debugging of hangs or runaway workers.
查询 |
行为 |
|---|---|
|
仅保留名称包含 |
|
仅保留 |
警告
The response contains live stack traces and can disclose internal code paths and state. Restrict network access to this endpoint in production.
示例:
curl -s 'http://localhost:8080/threads?name=periodic'
GET /periodic-threads#
返回 PeriodicThreadRegistry 的 JSON 快照:按级别统计以及每个线程的状态(上次运行时间戳、最新摘要等)。
查询 |
行为 |
|---|---|
|
仅包含给定级别的线程。对未知情况返回 |
|
仅包含当前正在运行的线程。 |
|
仅包括被视为活动的线程(最近的滴答)。 |
响应 (200 OK):
{
"summary": {
"total_count": 4,
"running_count": 4,
"active_count": 4,
"by_level": {
"critical": {"total": 1, "running": 1, "active": 1},
"high": {"total": 2, "running": 2, "active": 2},
"medium": {"total": 1, "running": 1, "active": 1},
"low": {"total": 0, "running": 0, "active": 0}
}
},
"threads": [
{
"name": "...",
"level": "high",
"interval": 5.0,
"is_running": true,
"is_active": true,
"last_run_ago": 1.2,
"total_runs": 120,
"failed_runs": 0,
"success_rate": 100.0,
"last_summary": {"...": "..."}
}
]
}
示例:
curl -s 'http://localhost:8080/periodic-threads?level=critical' | jq
GET /periodic-threads/{thread_name}#
Detailed status for a single periodic thread (the same per-thread object
shown in the threads list above). Returns 404 with
{"error": "Thread not found: <name>"} if the name is unknown.
示例:
curl -s http://localhost:8080/periodic-threads/storage-flush | jq
GET /periodic-threads-health#
Fast health check covering only critical and high level periodic
threads. A thread is flagged unhealthy when it is marked running but has
not ticked within its expected interval. Always returns 200 — health
is conveyed by the healthy boolean, not the HTTP status.
响应 (200 OK):
{
"healthy": true,
"unhealthy_count": 0,
"unhealthy_threads": []
}
当出现滞后时:
{
"healthy": false,
"unhealthy_count": 1,
"unhealthy_threads": [
{
"name": "storage-flush",
"level": "critical",
"last_run_ago": 42.5,
"interval": 5.0
}
]
}
示例:
curl -s http://localhost:8080/periodic-threads-health
GET /env#
Dumps the process environment variables as a sorted, pretty-printed
JSON document. The response Content-Type is text/plain so it can be
piped directly to a terminal.
警告
The payload contains every environment variable, including any secrets injected via the environment. There is no redaction or auth — restrict network access to this endpoint in production.
示例:
curl -s http://localhost:8080/env
POST /run_script#
Execute an uploaded Python script inside the server process. The script is
uploaded as multipart form data under the field name script and is
exec'd with a restricted __builtins__ (only print, str,
int, float, list, dict, tuple, set, and a guarded
__import__). Only modules listed in --script-allowed-imports can be
imported; the running FastAPI app is injected into the script globals.
If the script assigns a variable named result, its stringified value is
returned; otherwise the body is "Script executed successfully"
(Content-Type: text/plain).
危险
This endpoint runs caller-supplied code in-process. The restricted
builtins are not a security sandbox — combined with the injected
app object and any allowed imports, treat it as full remote code
execution. Never expose it on an untrusted network.
HTTP 状态码:
200: script executed.400: noscriptfile provided.500: an exception was raised during import setup or execution (body:"Error executing script: <reason>").
示例:
curl -s -X POST http://localhost:8080/run_script \
-F 'script=@my_script.py'
添加新端点#
Endpoints are auto-discovered from
lmcache/v1/multiprocess/http_apis/. To add a new MP-only endpoint:
在该目录中创建一个名为
<name>_api.py的新模块。定义一个模块级的
router = APIRouter()。使用 FastAPI 装饰器在
router上注册处理程序。通过
request.app.state.engine访问引擎,并检查None情况(引擎尚未初始化)。
HTTPAPIRegistry 将在启动时自动加载模块 — 无需编辑中央注册列表。
If the route is generic enough to be shared with the vLLM-embedded API
server, add it under lmcache/v1/internal_api_server/common/ instead.
It will be picked up on the MP side via common_api.py unless its
module name is listed in _MP_INCOMPATIBLE_MODULES there (reserved
for modules that require vLLM-specific app.state attributes; the
list is currently empty). A handler that lives under
internal_api_server/vllm/ can still be surfaced on the MP server by
adding a thin re-export shim under http_apis/ (as
version_api.py does for the version endpoints).
添加新端点时,请在此页面上添加一个对应的章节,说明该端点的用途、请求/响应结构以及一个示例 curl 调用。