HTTP API#

当 MP 服务器通过 lmcache server 启动时(推荐入口点),会同时暴露一个基于 FastAPI 的 HTTP 前端和 vLLM 使用的 ZMQ 套接字。此 HTTP API 面向运维人员、编排工具(例如 Kubernetes)和调试工具——它位于推理数据路径上。

Where the routes come from#

Routes are assembled from three sources, all merged into one FastAPI app by HTTPAPIRegistry at startup:

  • MP-native routes — any module named *_api.py under lmcache/v1/multiprocess/http_apis/ that exposes a module-level router (a fastapi.APIRouter) is auto-discovered. This covers the operational surface: status, cache control, L2 management, quota, and runtime reconfiguration.

  • Shared "common" routeslmcache/v1/multiprocess/http_apis/common_api.py aggregates every compatible router under lmcache/v1/internal_api_server/common/ (skipping any module listed in _MP_INCOMPATIBLE_MODULES, currently empty) and forwards them to the auto-discovery pipeline. These are the cross-server diagnostics shared with the vLLM-embedded API server (/env, /loglevel, /metrics, /threads, /periodic-threads*, /run_script). Adding a new compatible module under internal_api_server/common requires no wiring changes on the MP side.

  • Re-exported version routeslmcache/v1/multiprocess/http_apis/version_api.py re-exports the router from lmcache/v1/internal_api_server/vllm/version_api.py, exposing /version, /lmc_version, and /commit_id.

服务器配置#

参数

默认

描述

--http-host

0.0.0.0

绑定 HTTP 服务器的主机。

--http-port

8080

绑定 HTTP 服务器的端口。

示例:

lmcache server \
    --l1-size-gb 100 --eviction-policy LRU \
    --http-host 0.0.0.0 --http-port 8080

以下所有示例都假设服务器可以通过 http://localhost:8080 访问。

Endpoint Overview#

The routes are grouped by purpose below. The operational surface (health, status, cache and storage control) lives at top-level paths; routes inherited from the shared internal_api_server package keep their original paths for compatibility with the vLLM-embedded API server.

备注

Several handlers report failure in the response body rather than via a non-200 status code (e.g. DELETE /l2 returns 200 with ok=false, and /periodic-threads-health returns 200 with healthy=false). The error-field name is also not uniform: /healthcheck and /clear-cache use reason on failure, while /status, /conf, and /kvcache/check use error. Per-endpoint details below are authoritative.

Liveness and health

方法

路径

目的

GET

/

Static liveness ping (does not touch the engine).

GET

/healthcheck

K8s liveness/readiness probe; 503 until the engine is initialized.

Inspection and status

方法

路径

目的

GET

/status

Detailed engine snapshot (L1, L2, registered contexts, sessions, prefetch jobs) for inspection and debugging.

GET

/conf

Dump the merged server configuration objects (mp, storage_manager, observability).

GET

/version

Combined version string ("<version>-<commit_id>").

GET

/lmc_version

LMCache 包版本字符串。

GET

/commit_id

Build commit id.

Cache control

方法

路径

目的

POST

/clear-cache

强制清除 L1 (CPU) 内存中的所有 KV 数据。

GET

/kvcache/check

Compute MD5 checksums over the engine KV cache for a set of block IDs (diagnostics / round-trip integrity checks).

L2 storage management

方法

路径

目的

GET

/l2/adapters

Enumerate every configured L2 adapter with its type_name and primary flag.

DELETE

/l2

Delete a caller-supplied list of keys from one L2 adapter (default: primary; override with ?adapter=<type_name>).

GET

/l2/keys

Paginate keys currently resident in one L2 adapter (optionally filtered by model_name).

Quota management

方法

路径

目的

GET

/quota

列出每个注册的 cache_salt 配额及其实时使用情况。

PUT

/quota/{cache_salt}

设置或更新 cache_salt 的配额(以 GB 为单位)。

GET

/quota/{cache_salt}

读取单个 cache_salt 的配额和实时使用情况。

DELETE

/quota/{cache_salt}

移除 cache_salt 的配额条目(其数据将在下一个周期被逐出)。

Runtime L2 reconfiguration

方法

路径

目的

GET

/reconfigure/backends

List backend strings accepted by the reconfiguration routes.

GET

/reconfigure/{backend}/status

列出指定后端类型下可在运行时管理的 L2 适配器。

POST

/reconfigure/{backend}/{operation}

对后端适配器应用一个运行时重新配置操作。

Observability

方法

路径

目的

GET

/metrics

Prometheus 展示格式。

POST

/metrics/reset

将所有可观察性指标重置为初始状态。

Diagnostics and debugging

方法

路径

目的

GET

/loglevel

List or inspect logger levels; also accepts level to mutate one.

GET

/threads

列出活动的 Python 线程及其堆栈跟踪。

GET

/periodic-threads

列出注册的周期性线程及其摘要计数。

GET

/periodic-threads/{thread_name}

单个周期线程的详细状态。

GET

/periodic-threads-health

对关键/高层周期线程的快速健康检查。

GET

/env

Dump process environment variables (JSON body, text/plain).

POST

/run_script

Execute an uploaded Python script in a restricted sandbox.

Liveness and Health#

GET /#

Basic liveness check. Returns a static payload indicating the HTTP server is running; it does not touch the cache engine. Use /healthcheck instead for probes that also verify the engine is initialized.

响应 (200 OK):

{
  "status": "ok",
  "service": "LMCache HTTP API"
}

示例:

curl -s http://localhost:8080/

GET /healthcheck#

Health check endpoint suitable for Kubernetes liveness and readiness probes. A 200 response means the HTTP server is alive and the MP cache engine object is wired onto app.state. A 503 response indicates the engine is not yet present (still initializing, or failed to initialize). The check verifies that the engine attribute is set; it does not call into the engine to assert deeper liveness.

响应 (200 OK):

{
  "status": "healthy"
}

响应 (503 服务不可用):

{
  "status": "unhealthy",
  "reason": "engine not initialized"
}

示例:

curl -s http://localhost:8080/healthcheck

Kubernetes 探针代码片段:

livenessProbe:
  httpGet:
    path: /healthcheck
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
readinessProbe:
  httpGet:
    path: /healthcheck
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Inspection and Status#

GET /status#

Returns a detailed snapshot of the MP engine's internal state. The payload is assembled by MPCacheServer.report_status(): a fixed set of engine-level fields, the full storage-manager status, plus whatever keys each loaded module contributes (so the exact key set depends on which modules are active — registered_gpu_ids / cache_context_meta come from the transfer module, active_prefetch_jobs from the lookup module, and blend modes add their own fields). Intended for operators and debugging, not for monitoring (use Prometheus metrics for time-series data — see 可观察性).

响应 (200 OK):

{
  "is_healthy": true,
  "engine_type": "MPCacheServer",
  "chunk_size": 256,
  "hash_algorithm": "builtin-hash",
  "active_sessions": 2,
  "registered_gpu_ids": [0, 1],
  "cache_context_meta": {
    "0": {
      "model_name": "meta-llama/Llama-3.1-8B-Instruct",
      "world_size": 1,
      "kv_cache_layout": {
        "num_layers": 32,
        "num_blocks": 12345,
        "cache_size_per_token": 131072,
        "kernel_groups": [
          {
            "kernel_group_idx": 0,
            "engine_group_idx": 0,
            "object_group_idx": 0,
            "num_layers": 32,
            "layer_indices": [0, 1, "..."],
            "tokens_per_block": 16,
            "slots_per_block": 16,
            "dtype": "torch.bfloat16",
            "engine_kv_concrete_shape": "...",
            "is_mla": false,
            "engine_kv_format": "...",
            "engine_kv_shape": "...",
            "attention_backend": "..."
          }
        ]
      }
    }
  },
  "active_prefetch_jobs": 0,
  "storage_manager": {
    "is_healthy": true,
    "...": "backend-specific fields"
  }
}

响应 (503 服务不可用) 当引擎尚未初始化时:

{
  "error": "engine not initialized"
}

示例:

curl -s http://localhost:8080/status | jq

GET /conf#

以单一缩进 JSON 文档返回注册在 app.state.configs 上的全部服务器端配置对象(通常包括 mpstorage_managerobservability)。数据类通过 safe_asdict 序列化;其他值通过 make_json_safe 处理。可用于确认进程实际加载的配置内容(含环境变量覆盖),无需重启服务。

响应 (200 OK):

{
  "mp": {
    "http_host": "0.0.0.0",
    "http_port": 8080,
    "...": "..."
  },
  "storage_manager": {
    "...": "..."
  },
  "observability": {
    "...": "..."
  }
}

响应503 服务不可用),当配置尚未连接到 app.state 时:

{
  "error": "configs not initialized"
}

示例:

curl -s http://localhost:8080/conf | jq

GET /version, GET /lmc_version, GET /commit_id#

Version descriptors. Each returns a bare JSON string (not an object):

  • GET /version — the combined descriptor from lmcache.utils.get_version(), formatted "<version>-<commit_id>" (e.g. "0.3.1-ca79ea33"). On a source checkout without build-time version metadata, each missing component falls back to the literal "NA" (so a metadata-less build returns "NA-NA").

  • GET /lmc_version — the raw package version string (lmcache.utils.VERSION); empty string "" when the generated lmcache._version module is absent.

  • GET /commit_id — the git commit id baked into the build (lmcache.utils.COMMIT_ID); empty string "" when unavailable.

All three are unconditional 200 OK.

示例:

curl -s http://localhost:8080/version
curl -s http://localhost:8080/lmc_version
curl -s http://localhost:8080/commit_id

Cache Control#

POST /clear-cache#

Force-clears all KV cache data currently held in L1 (CPU) memory (delegates to the ManagementModule).

警告

此端点为破坏性操作,会绕过读/写锁。正在进行的存储或预取操作可能因此损坏。仅在服务器空闲时,或在已知缓存状态异常需要恢复时使用。

请求体将被忽略。

响应 (200 OK):

{
  "status": "ok"
}

响应 (503 服务不可用):

{
  "status": "error",
  "reason": "engine not initialized"
}

示例:

curl -s -X POST http://localhost:8080/clear-cache

GET /kvcache/check#

Compute MD5 checksums over the engine KV cache, grouped chunk_size blocks per hashed chunk. MP mode addresses KV storage by block IDs natively (the same units used by STORE / RETRIEVE), so the endpoint is fully block-centric: block_ids enumerates the target blocks and chunk_size counts blocks per chunk. Intended for diagnostics and round-trip integrity checks from lmcache bench server — not for the inference data path.

查询参数:

名称

必需的

描述

block_ids

Engine block IDs in mixed format, e.g. "0,[2,5],8".

chunk_size

正整数 — 每个哈希块的块数。

instance_id

否(默认 0

Registered KV context ID on the engine.

layerwise

否(默认 false

如果为 true,则返回按 "layer_<idx>" 键入的逐层校验和;否则返回每个块的所有层的单个聚合摘要。

响应 (200 OK):

{
  "status": "success",
  "chunk_size": 2,
  "num_chunks": 2,
  "chunk_checksums": ["<md5>", "<md5>"],
  "layerwise": false,
  "block_id_ranges": "0,[2,5],8"
}

layerwise=true 时,chunk_checksums 是一个以 "layer_<idx>" 为键的字典,其值是逐层列表。

HTTP 状态码:

  • 200: 成功。

  • 400: block_ids 缺失/格式错误,或 chunk_size 缺失或非正数。

  • 404: instance_id 未注册,或者注册的 KV 张量为空。

  • 501: 引擎没有 cache_contexts,或者该端点不支持 KV 格式(页面缓冲融合和跨层布局在真正需要之前被拒绝)。

  • 503: 引擎尚未在 app.state 上初始化。

示例:

curl -s "http://localhost:8080/kvcache/check?block_ids=0,1,2,3&chunk_size=2"

curl -s "http://localhost:8080/kvcache/check?block_ids=0,1,2,3&chunk_size=2&layerwise=true"

L2 Storage Management#

Three endpoints — GET /l2/adapters, DELETE /l2, and GET /l2/keys — let operators enumerate the configured L2 backends, purge keys from one, and enumerate what is currently resident.

DELETE /l2 and GET /l2/keys accept an optional ?adapter=<type_name> query parameter to target a specific adapter. Omit the selector to target the primary (first-configured) adapter — the v1 behavior, preserved for clients that don't care about multi-adapter deployments. When multiple adapters share a type_name, the first match wins. Use GET /l2/adapters to learn the valid selectors.

All three are intended for operator / admin workflows ("purge this user's keys", "show me what's resident", "garbage-collect orphans after a rename"). They are not on the inference data path.

L1 is intentionally not touched. Keys deleted from L2 may still return from L1 until the L1 eviction controller expires them naturally; callers that need an L1+L2 purge should layer their own L1 invalidation or wait for natural L1 eviction.

The coordinator's eviction loop uses DELETE /l2 automatically (see 多服务器协调 — "L2 usage tracking and eviction"); the GET /l2/keys endpoint also powers the coordinator's startup resync. Manual curl usage is reserved for ad-hoc operator actions and debugging.

For full request/response semantics, pagination, error codes, and the event flow back to the coordinator, see the design doc at docs/design/v1/multiprocess/l2_apis.md.

GET /l2/adapters#

Enumerate every L2 adapter the engine has loaded, in configuration order.

响应 (200 OK):

{
  "adapters": [
    {"index": 0, "type_name": "S3L2Adapter", "primary": true},
    {"index": 1, "type_name": "FSL2Adapter", "primary": false}
  ]
}

primary is true only on the first entry. An engine that has no L2 backends returns {"adapters": []} (still 200 — the engine is initialized, it just has no L2 storage).

HTTP 状态码:

  • 200: success (including the no-adapters case).

  • 503: engine not initialized.

示例:

curl -s http://localhost:8080/l2/adapters | jq

DELETE /l2#

Delete a caller-supplied list of keys from one L2 adapter. Idempotent: keys absent from the adapter are skipped silently; keys currently locked by in-flight store/load tasks are skipped so the delete never corrupts an active transfer. The blocking adapter call is run off the event loop.

查询参数:

名称

默认

描述

adapter

primary

type_name of the target adapter (see GET /l2/adapters). Omit to target the primary (first-configured) adapter. First match wins when multiple adapters share a type_name.

Per-key successful deletions fire on_l2_keys_deleted on the adapter's listeners — when the coordinator is wired (see --coordinator-l2-event-reporting), the deletions show up at the coordinator's POST /l2/events as "type": "delete" events. The coordinator's eviction + usage trackers learn about the deletion from that event flow, not from the response of this call.

Body: {"keys": [EncodedObjectKey, ...]} where each EncodedObjectKey is

{
  "chunk_hash_hex": "abc123...",
  "model_name": "meta-llama/Llama-3-8B",
  "kv_rank": 0,
  "object_group_id": 0,
  "cache_salt": "user-a"
}

object_group_id (default 0) and cache_salt (default "") are optional for backward compatibility with older wire payloads. The batch is capped at 10000 keys per request.

响应 (200 OK):

{
  "requested": 2,
  "adapter": "S3L2Adapter",
  "ok": true
}

On adapter-level failure the response is still 200 with ok=false and an error field carrying the reason.

HTTP 状态码:

  • 200: request reached the adapter (check ok for outcome).

  • 400: batch exceeds the limit, or a key payload violates an ObjectKey invariant (bad hex, @ in model_name, forbidden cache_salt character).

  • 404: ?adapter=<name> does not match any configured adapter.

  • 422: Pydantic-level body-shape failure (missing keys, wrong field types).

  • 503: engine not initialized, or no L2 adapters configured.

示例:

curl -s -X DELETE http://localhost:8080/l2 \
    -H 'Content-Type: application/json' \
    -d '{
        "keys": [
          {"chunk_hash_hex": "aa", "model_name": "m",
           "kv_rank": 0, "object_group_id": 0, "cache_salt": "user-a"}
        ]
    }'

GET /l2/keys#

Paginate keys currently resident in one L2 adapter.

查询参数:

名称

默认

描述

adapter

primary

type_name of the target adapter (see GET /l2/adapters). Omit to target the primary (first-configured) adapter. First match wins when multiple adapters share a type_name.

model_name

none

Restrict the result to keys whose model_name matches.

page_size

500

Max entries per page. Must be in [1, 5000]; an out-of-range value is rejected with 422 (it is not silently clamped).

page_token

none

Opaque cursor from the previous page's next_page_token. Omit on the first call; pass back verbatim on subsequent calls.

The page token is private to the adapter; do not parse or modify it. Adapters that support listing (currently only the S3 adapter via ListObjectsV2) guarantee best-effort consistency, not snapshot isolation — concurrent stores or deletes during a paginated walk may cause keys to appear, disappear, or shift between pages.

响应 (200 OK):

{
  "adapter": "S3L2Adapter",
  "entries": [
    {
      "key": {
        "chunk_hash_hex": "abc123",
        "model_name": "meta-llama/Llama-3-8B",
        "kv_rank": 0,
        "object_group_id": 0,
        "cache_salt": "user-a"
      },
      "size_bytes": 4194304
    }
  ],
  "next_page_token": "opaque-cursor-string"
}

next_page_token is null when the listing is exhausted.

HTTP 状态码:

  • 200: 成功。

  • 400: malformed page_token (adapter-level).

  • 404: ?adapter=<name> does not match any configured adapter.

  • 422: page_size outside [1, 5000].

  • 501: selected adapter does not implement listing. In v1 only S3L2Adapter does; adapters wrapped by SerdeL2AdapterWrapper inherit the wrapped adapter's behavior.

  • 503: engine not initialized, or no L2 adapters configured.

Example: paginate every key for a model.

next=""
while :; do
  page=$(curl -s "http://localhost:8080/l2/keys?model_name=meta-llama/Llama-3-8B&page_size=500&page_token=$next")
  echo "$page" | jq '.entries[]'
  next=$(echo "$page" | jq -r '.next_page_token // empty')
  [ -z "$next" ] && break
done

Quota Management#

这些端点管理由 IsolatedLRU 逐出策略(通过 --eviction-policy IsolatedLRU 选择)消耗的每个 cache_salt 存储预算。配额是 软性 的:设置限制并不会拒绝写入 — 任何超出预算的 cache_salt 会在下一个逐出周期(约 1 秒)被逐出。没有注册配额的 cache_salt 有一个有效限制为 0 字节,因此其数据将在下一个周期被清除(白名单语义)。

对于未使用 --eviction-policy IsolatedLRU 启动的引擎,这些端点为空操作:QuotaManager 仍然存在,但 LRU 策略会忽略已注册的配额。

URL escaping for the empty salt. cache_salt="" (un-salted / anonymous traffic) cannot appear in a URL path parameter, so the API accepts the sentinel _default in its place. PUT /quota/_default sets the quota for cache_salt="", and _default is echoed back in responses for the empty salt. A user that legitimately stores data with cache_salt="_default" cannot be managed via this HTTP API distinctly from anonymous traffic — both map to the same path parameter; pick any other value (e.g. "default") to disambiguate.

PUT /quota/{cache_salt}#

创建或更新配额。

主体: {\"limit_gb\": <float>} (必需,有限,非负)。

响应 (200 OK):

{"cache_salt": "alice", "limit_gb": 10.0, "status": "ok"}

错误: 400 表示 JSON 格式错误、缺少 limit_gblimit_gb 不是数字、nan / inf 或负值;503 表示引擎未初始化。

示例:

curl -s -X PUT http://localhost:8080/quota/alice \
    -H 'Content-Type: application/json' \
    -d '{"limit_gb": 10.0}'

GET /quota/{cache_salt}#

读取当前配额和一个 cache_salt 的实时使用情况。

响应 (200 OK):

{
  "cache_salt": "alice",
  "limit_gb": 10.0,
  "current_usage_gb": 2.137,
  "exists": true
}

exists is false when no quota was ever registered for this cache_salt (limit_gb is then 0.0 and current_usage_gb reflects whatever bytes are currently cached for that salt — those bytes will evict next cycle under IsolatedLRU). This endpoint never returns 404 for an unknown salt.

DELETE /quota/{cache_salt}#

删除 cache_salt 的配额条目。任何仍然缓存于此 cache_salt 下的字节将在下一个逐出周期中超出预算(有效限制降至 0),并将被逐出。

响应 (200 OK):

{"cache_salt": "alice", "status": "removed"}

当给定的 cache_salt 没有注册配额时,响应为 {\"cache_salt\": \"...\", \"status\": \"not_found\"}(仍然是 200 OK)。

GET /quota#

列出每个注册的配额及其实时使用情况。

响应 (200 OK):

{
  "users": {
    "alice": {"limit_gb": 10.0, "current_usage_gb": 2.137},
    "bob":   {"limit_gb":  4.0, "current_usage_gb": 0.812}
  }
}

Only cache_salt values with a registered quota appear; the empty salt is reported under the _default key.

Runtime L2 Reconfiguration#

这些端点在服务器具有可运行时重新配置的 L2 适配器时可用。它们仅更改 LMCache 的运行时映射和元数据;后端资源如 DAX 设备路径必须已经存在,并且服务器必须能够读取和写入。该端点将 backendoperation 和 JSON 请求体路由到通用 L2 适配器重新配置 API,而后端特定的验证和迁移语义则保留在适配器内部。

backend and operation path segments are normalized (stripped and lower-cased). Within a request body, adapter_index (default 0) is backend-local — it indexes only the adapters of that backend, not the engine-wide adapter list. If an L2 adapter is wrapped by serde, the backend string is still the configured L2 adapter type, not the serde wrapper type.

GET /reconfigure/backends#

List the backend strings that can be used in /reconfigure/{backend}/status and /reconfigure/{backend}/{operation}.

响应 (200 OK):

{
  "enabled": true,
  "num_backends": 1,
  "backends": ["dax"]
}

enabled is false (and backends empty) when no reconfigurable adapter is present.

HTTP status codes: 200 on success; 503 if the engine is not initialized.

GET /reconfigure/{backend}/status#

Report the runtime-manageable adapters for one backend type. Each adapter entry's adapter_index is rewritten to its backend-local 0-based index (the value to pass back in operation request bodies).

响应 (200 OK):

{
  "enabled": true,
  "backend": "dax",
  "num_adapters": 1,
  "adapters": [
    {"adapter_index": 0, "...": "backend-specific adapter fields"}
  ]
}

An unknown or empty backend returns enabled=false, num_adapters=0, adapters=[] (it is not a 404).

HTTP status codes: 200 on success; 400 if backend is empty; 503 if the engine is not initialized.

POST /reconfigure/{backend}/{operation}#

Apply one reconfiguration operation to a backend adapter. The request body is a JSON object whose accepted fields depend on the backend and operation. The 200 response is whatever the storage manager's reconfigure_l2_adapter returns (a backend-defined dict).

For the generic path (any backend other than dax), the body carries adapter_index plus any backend-specific fields, which are forwarded verbatim to the adapter.

For Device-DAX (backend=dax), JSON request bodies are used because DAX paths contain slashes. The accepted operations and fields are:

Operation

Body fields

add

device_path (str, required), size (int byte count or string such as "100GiB", required), adapter_index (default 0).

remove

device_path (str, required), mode (migrate | evict | drain, default migrate), force (bool, default false), adapter_index (default 0).

resize

device_path (str, required), size (int or string, required), mode (migrate | evict, default migrate), force (bool, default false), adapter_index (default 0).

size accepts an integer byte count or a string with a base-1024 unit suffix (b, kib, mib, gib, tib and the k/m/g/t aliases), e.g. "100GiB"; it must resolve to a positive value.

HTTP 状态码:

  • 200: success (body is the storage manager's reconfigure result).

  • 400: empty backend/operation, an unsupported DAX operation, or an invalid size.

  • 404: adapter_index is out of range for the backend.

  • 422: request body fails validation (e.g. a missing required field, or an unknown field in a DAX body — DAX bodies reject extras).

  • 503: engine not initialized.

有关详细的请求示例、模式语义和验证指导,请参阅 设备-DAX (/dev/dax)

Observability#

GET /metrics#

Prometheus exposition format for every metric registered on the default prometheus_client registry (Content-Type: text/plain). Scrape this directly from Prometheus. See 可观察性 for the list of exported metrics.

示例:

curl -s http://localhost:8080/metrics

POST /metrics/reset#

将所有 LMCache 可观察性指标重置为初始状态(reset_observability_metrics)。面向测试框架和基准测试,不适用于生产环境。

Response (200 OK, text/plain):

ok

示例:

curl -s -X POST http://localhost:8080/metrics/reset

Diagnostics and Debugging#

GET /loglevel#

在运行时检查或修改 Python 日志记录器级别。所有响应都是 text/plain。该端点有三种模式,由查询参数驱动:

查询

行为

(无参数)

列出所有在 logging 中注册的记录器及其级别。

?logger_name=<name>

返回指定记录器的有效级别。

?logger_name=<name>&level=<LEVEL>

Set the named logger (and its handlers) to LEVEL (DEBUG/INFO/WARNING/ERROR/CRITICAL; case-insensitive). Returns 400 on an unknown level.

Passing level without logger_name matches none of the modes and returns 200 with a null body.

示例:

# list everything
curl -s http://localhost:8080/loglevel

# read one
curl -s 'http://localhost:8080/loglevel?logger_name=lmcache'

# elevate to DEBUG
curl -s 'http://localhost:8080/loglevel?logger_name=lmcache&level=DEBUG'

GET /threads#

Enumerate active Python threads in the server process along with their stack traces, plus a total-count summary (Content-Type: text/plain). Useful for live debugging of hangs or runaway workers.

查询

行为

?name=<substr>

仅保留名称包含 <substr> 的线程(不区分大小写)。

?thread_id=<int>

仅保留 ident 匹配的线程。

警告

The response contains live stack traces and can disclose internal code paths and state. Restrict network access to this endpoint in production.

示例:

curl -s 'http://localhost:8080/threads?name=periodic'

GET /periodic-threads#

返回 PeriodicThreadRegistry 的 JSON 快照:按级别统计以及每个线程的状态(上次运行时间戳、最新摘要等)。

查询

行为

?level=critical|high|medium|low

仅包含给定级别的线程。对未知情况返回 400

?running_only=true

仅包含当前正在运行的线程。

?active_only=true

仅包括被视为活动的线程(最近的滴答)。

响应 (200 OK):

{
  "summary": {
    "total_count": 4,
    "running_count": 4,
    "active_count": 4,
    "by_level": {
      "critical": {"total": 1, "running": 1, "active": 1},
      "high":     {"total": 2, "running": 2, "active": 2},
      "medium":   {"total": 1, "running": 1, "active": 1},
      "low":      {"total": 0, "running": 0, "active": 0}
    }
  },
  "threads": [
    {
      "name": "...",
      "level": "high",
      "interval": 5.0,
      "is_running": true,
      "is_active": true,
      "last_run_ago": 1.2,
      "total_runs": 120,
      "failed_runs": 0,
      "success_rate": 100.0,
      "last_summary": {"...": "..."}
    }
  ]
}

示例:

curl -s 'http://localhost:8080/periodic-threads?level=critical' | jq

GET /periodic-threads/{thread_name}#

Detailed status for a single periodic thread (the same per-thread object shown in the threads list above). Returns 404 with {"error": "Thread not found: <name>"} if the name is unknown.

示例:

curl -s http://localhost:8080/periodic-threads/storage-flush | jq

GET /periodic-threads-health#

Fast health check covering only critical and high level periodic threads. A thread is flagged unhealthy when it is marked running but has not ticked within its expected interval. Always returns 200 — health is conveyed by the healthy boolean, not the HTTP status.

响应 (200 OK):

{
  "healthy": true,
  "unhealthy_count": 0,
  "unhealthy_threads": []
}

当出现滞后时:

{
  "healthy": false,
  "unhealthy_count": 1,
  "unhealthy_threads": [
    {
      "name": "storage-flush",
      "level": "critical",
      "last_run_ago": 42.5,
      "interval": 5.0
    }
  ]
}

示例:

curl -s http://localhost:8080/periodic-threads-health

GET /env#

Dumps the process environment variables as a sorted, pretty-printed JSON document. The response Content-Type is text/plain so it can be piped directly to a terminal.

警告

The payload contains every environment variable, including any secrets injected via the environment. There is no redaction or auth — restrict network access to this endpoint in production.

示例:

curl -s http://localhost:8080/env

POST /run_script#

Execute an uploaded Python script inside the server process. The script is uploaded as multipart form data under the field name script and is exec'd with a restricted __builtins__ (only print, str, int, float, list, dict, tuple, set, and a guarded __import__). Only modules listed in --script-allowed-imports can be imported; the running FastAPI app is injected into the script globals. If the script assigns a variable named result, its stringified value is returned; otherwise the body is "Script executed successfully" (Content-Type: text/plain).

危险

This endpoint runs caller-supplied code in-process. The restricted builtins are not a security sandbox — combined with the injected app object and any allowed imports, treat it as full remote code execution. Never expose it on an untrusted network.

HTTP 状态码:

  • 200: script executed.

  • 400: no script file provided.

  • 500: an exception was raised during import setup or execution (body: "Error executing script: <reason>").

示例:

curl -s -X POST http://localhost:8080/run_script \
    -F 'script=@my_script.py'

添加新端点#

Endpoints are auto-discovered from lmcache/v1/multiprocess/http_apis/. To add a new MP-only endpoint:

  1. 在该目录中创建一个名为 <name>_api.py 的新模块。

  2. 定义一个模块级的 router = APIRouter()

  3. 使用 FastAPI 装饰器在 router 上注册处理程序。

  4. 通过 request.app.state.engine 访问引擎,并检查 None 情况(引擎尚未初始化)。

HTTPAPIRegistry 将在启动时自动加载模块 — 无需编辑中央注册列表。

If the route is generic enough to be shared with the vLLM-embedded API server, add it under lmcache/v1/internal_api_server/common/ instead. It will be picked up on the MP side via common_api.py unless its module name is listed in _MP_INCOMPATIBLE_MODULES there (reserved for modules that require vLLM-specific app.state attributes; the list is currently empty). A handler that lives under internal_api_server/vllm/ can still be surfaced on the MP server by adding a thin re-export shim under http_apis/ (as version_api.py does for the version endpoints).

添加新端点时,请在此页面上添加一个对应的章节,说明该端点的用途、请求/响应结构以及一个示例 curl 调用。