HTTP API#
When the MP server is started via lmcache server (the recommended entry
point), a FastAPI-based HTTP frontend is exposed alongside the ZMQ socket
used by vLLM. This HTTP API is intended for operators, orchestrators
(e.g. Kubernetes), and debugging tools — it is not on the inference
data path.
Where the routes come from#
Routes are assembled from three sources, all merged into one FastAPI app by
HTTPAPIRegistry at startup:
MP-native routes — any module named
*_api.pyunderlmcache/v1/multiprocess/http_apis/that exposes a module-levelrouter(afastapi.APIRouter) is auto-discovered. This covers the operational surface: status, cache control, L2 management, quota, and runtime reconfiguration.Shared “common” routes —
lmcache/v1/multiprocess/http_apis/common_api.pyaggregates every compatible router underlmcache/v1/internal_api_server/common/(skipping any module listed in_MP_INCOMPATIBLE_MODULES, currently empty) and forwards them to the auto-discovery pipeline. These are the cross-server diagnostics shared with the vLLM-embedded API server (/env,/loglevel,/metrics,/threads,/periodic-threads*,/run_script). Adding a new compatible module underinternal_api_server/commonrequires no wiring changes on the MP side.Re-exported version routes —
lmcache/v1/multiprocess/http_apis/version_api.pyre-exports the router fromlmcache/v1/internal_api_server/vllm/version_api.py, exposing/version,/lmc_version, and/commit_id.
Server Configuration#
Argument |
Default |
Description |
|---|---|---|
|
|
Host to bind the HTTP server. |
|
|
Port to bind the HTTP server. |
Example:
lmcache server \
--l1-size-gb 100 --eviction-policy LRU \
--http-host 0.0.0.0 --http-port 8080
All examples below assume the server is reachable at
http://localhost:8080.
Endpoint Overview#
The routes are grouped by purpose below. The operational surface (health,
status, cache and storage control) lives at top-level paths; routes inherited
from the shared internal_api_server package keep their original paths for
compatibility with the vLLM-embedded API server.
Note
Several handlers report failure in the response body rather than via a
non-200 status code (e.g. DELETE /l2 returns 200 with ok=false,
and /periodic-threads-health returns 200 with healthy=false).
The error-field name is also not uniform: /healthcheck and
/clear-cache use reason on failure, while /status, /conf,
and /kvcache/check use error. Per-endpoint details below are
authoritative.
Liveness and health
Method |
Path |
Purpose |
|---|---|---|
GET |
|
Static liveness ping (does not touch the engine). |
GET |
|
K8s liveness/readiness probe; |
Inspection and status
Method |
Path |
Purpose |
|---|---|---|
GET |
|
Detailed engine snapshot (L1, L2, registered contexts, sessions, prefetch jobs) for inspection and debugging. |
GET |
|
Dump the merged server configuration objects ( |
GET |
|
Combined version string ( |
GET |
|
LMCache package version string. |
GET |
|
Build commit id. |
Cache control
Method |
Path |
Purpose |
|---|---|---|
POST |
|
Force-clear all KV data in L1 (CPU) memory. |
GET |
|
Compute MD5 checksums over the engine KV cache for a set of block IDs (diagnostics / round-trip integrity checks). |
L2 storage management
Method |
Path |
Purpose |
|---|---|---|
GET |
|
Enumerate every configured L2 adapter with its |
DELETE |
|
Delete a caller-supplied list of keys from one L2 adapter (default:
primary; override with |
GET |
|
Paginate keys currently resident in one L2 adapter (optionally
filtered by |
POST |
|
Warm a node’s L1 by loading a token sequence’s chunks from L2 ahead
of traffic; returns a |
GET |
|
Poll a submitted warm prefetch ( |
Quota management
Method |
Path |
Purpose |
|---|---|---|
GET |
|
List every registered |
PUT |
|
Set or update the quota (in GB) for a |
GET |
|
Read the quota and live usage for a single |
DELETE |
|
Remove a |
Runtime L2 reconfiguration
Method |
Path |
Purpose |
|---|---|---|
GET |
|
List backend strings accepted by the reconfiguration routes. |
GET |
|
Report runtime-manageable L2 adapters for one backend type. |
POST |
|
Apply one runtime reconfiguration operation to a backend adapter. |
Observability
Method |
Path |
Purpose |
|---|---|---|
GET |
|
Prometheus exposition format. |
POST |
|
Reset all observability metrics to their initial state. |
Diagnostics and debugging
Method |
Path |
Purpose |
|---|---|---|
GET |
|
List or inspect logger levels; also accepts |
GET |
|
Enumerate active Python threads and their stack traces. |
GET |
|
List registered periodic threads with summary counts. |
GET |
|
Detailed status for a single periodic thread. |
GET |
|
Quick health check for critical/high-level periodic threads. |
GET |
|
Dump process environment variables (JSON body, |
POST |
|
Execute an uploaded Python script in a restricted sandbox. |
Liveness and Health#
GET /#
Basic liveness check. Returns a static payload indicating the HTTP server
is running; it does not touch the cache engine. Use /healthcheck
instead for probes that also verify the engine is initialized.
Response (200 OK):
{
"status": "ok",
"service": "LMCache HTTP API"
}
Example:
curl -s http://localhost:8080/
GET /healthcheck#
Health check endpoint suitable for Kubernetes liveness and readiness
probes. A 200 response means the HTTP server is alive and the MP
cache engine object is wired onto app.state. A 503 response
indicates the engine is not yet present (still initializing, or failed to
initialize). The check verifies that the engine attribute is set; it does
not call into the engine to assert deeper liveness.
Response (200 OK):
{
"status": "healthy"
}
Response (503 Service Unavailable):
{
"status": "unhealthy",
"reason": "engine not initialized"
}
Example:
curl -s http://localhost:8080/healthcheck
Kubernetes probe snippet:
livenessProbe:
httpGet:
path: /healthcheck
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /healthcheck
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Inspection and Status#
GET /status#
Returns a detailed snapshot of the MP engine’s internal state. The payload is
assembled by MPCacheServer.report_status(): a fixed set of engine-level
fields, the full storage-manager status, plus whatever keys each loaded module
contributes (so the exact key set depends on which modules are active —
registered_gpu_ids / cache_context_meta come from the transfer module,
active_prefetch_jobs from the lookup module, and blend modes add their own
fields). Intended for operators and debugging, not for monitoring (use
Prometheus metrics for time-series data — see Observability).
Response (200 OK):
{
"is_healthy": true,
"engine_type": "MPCacheServer",
"chunk_size": 256,
"hash_algorithm": "builtin-hash",
"active_sessions": 2,
"registered_gpu_ids": [0, 1],
"cache_context_meta": {
"0": {
"model_name": "meta-llama/Llama-3.1-8B-Instruct",
"world_size": 1,
"kv_cache_layout": {
"num_layers": 32,
"num_blocks": 12345,
"cache_size_per_token": 131072,
"kernel_groups": [
{
"kernel_group_idx": 0,
"engine_group_idx": 0,
"object_group_idx": 0,
"num_layers": 32,
"layer_indices": [0, 1, "..."],
"tokens_per_block": 16,
"slots_per_block": 16,
"dtype": "torch.bfloat16",
"engine_kv_concrete_shape": "...",
"is_mla": false,
"engine_kv_format": "...",
"engine_kv_shape": "...",
"attention_backend": "..."
}
]
}
}
},
"active_prefetch_jobs": 0,
"storage_manager": {
"is_healthy": true,
"...": "backend-specific fields"
}
}
Response (503 Service Unavailable) when the engine has not yet
been initialized:
{
"error": "engine not initialized"
}
Example:
curl -s http://localhost:8080/status | jq
GET /conf#
Returns every server-side configuration object registered on
app.state.configs (typically mp, storage_manager and
observability) as a single indented JSON document. Dataclasses are
serialized via safe_asdict; other values go through make_json_safe.
Useful for confirming what the process actually loaded — including
environment overrides — without restarting.
Response (200 OK):
{
"mp": {
"http_host": "0.0.0.0",
"http_port": 8080,
"...": "..."
},
"storage_manager": {
"...": "..."
},
"observability": {
"...": "..."
}
}
Response (503 Service Unavailable) when configs are not wired
onto app.state yet:
{
"error": "configs not initialized"
}
Example:
curl -s http://localhost:8080/conf | jq
GET /version, GET /lmc_version, GET /commit_id#
Version descriptors. Each returns a bare JSON string (not an object):
GET /version— the combined descriptor fromlmcache.utils.get_version(), formatted"<version>-<commit_id>"(e.g."0.3.1-ca79ea33"). On a source checkout without build-time version metadata, each missing component falls back to the literal"NA"(so a metadata-less build returns"NA-NA").GET /lmc_version— the raw package version string (lmcache.utils.VERSION); empty string""when the generatedlmcache._versionmodule is absent.GET /commit_id— the git commit id baked into the build (lmcache.utils.COMMIT_ID); empty string""when unavailable.
All three are unconditional 200 OK.
Examples:
curl -s http://localhost:8080/version
curl -s http://localhost:8080/lmc_version
curl -s http://localhost:8080/commit_id
Cache Control#
POST /clear-cache#
Force-clears all KV cache data currently held in L1 (CPU) memory
(delegates to the ManagementModule).
Warning
This endpoint is destructive and bypasses read/write locks. In-flight store or prefetch operations may be corrupted. Use only when the server is idle, or when recovering from a known-bad cache state.
The request body is ignored.
Response (200 OK):
{
"status": "ok"
}
Response (503 Service Unavailable):
{
"status": "error",
"reason": "engine not initialized"
}
Example:
curl -s -X POST http://localhost:8080/clear-cache
GET /kvcache/check#
Compute MD5 checksums over the engine KV cache, grouped chunk_size blocks
per hashed chunk. MP mode addresses KV storage by block IDs natively (the
same units used by STORE / RETRIEVE), so the endpoint is fully
block-centric: block_ids enumerates the target blocks and
chunk_size counts blocks per chunk. Intended for diagnostics and
round-trip integrity checks from lmcache bench server — not for the
inference data path.
Query parameters:
Name |
Required |
Description |
|---|---|---|
|
yes |
Engine block IDs in mixed format, e.g. |
|
yes |
Positive integer — number of blocks per hashed chunk. |
|
no (default |
Registered KV context ID on the engine. |
|
no (default |
If |
Response (200 OK):
{
"status": "success",
"chunk_size": 2,
"num_chunks": 2,
"chunk_checksums": ["<md5>", "<md5>"],
"layerwise": false,
"block_id_ranges": "0,[2,5],8"
}
When layerwise=true, chunk_checksums is a dict keyed by
"layer_<idx>" whose values are per-layer lists.
HTTP status codes:
200: success.400:block_idsmissing/malformed, orchunk_sizemissing or non-positive.404:instance_idnot registered, or the registered KV tensors are empty.501: engine has nocache_contexts, or the KV format is not supported by this endpoint (page-buffer-fused and cross-layer layouts are declined until a real need appears).503: engine not yet initialized onapp.state.
Example:
curl -s "http://localhost:8080/kvcache/check?block_ids=0,1,2,3&chunk_size=2"
curl -s "http://localhost:8080/kvcache/check?block_ids=0,1,2,3&chunk_size=2&layerwise=true"
L2 Storage Management#
Three endpoints — GET /l2/adapters, DELETE /l2, and
GET /l2/keys — let operators enumerate the configured L2
backends, purge keys from one, and enumerate what is currently
resident.
A further pair — POST /l2/prefetch and
GET /l2/prefetch/{request_id} — lets an operator (or the
coordinator) pre-warm a node’s L1 from L2 ahead of traffic and poll
the load to completion; they are documented at the end of this section.
The coordinator exposes an instance_id-routed variant of both — see
Multi-Server Coordination.
DELETE /l2 and GET /l2/keys accept an optional
?adapter=<type_name> query parameter to target a specific adapter.
Omit the selector to target the primary (first-configured)
adapter — the v1 behavior, preserved for clients that don’t care
about multi-adapter deployments. When multiple adapters share a
type_name, the first match wins. Use GET /l2/adapters to learn
the valid selectors.
All three are intended for operator / admin workflows (“purge this user’s keys”, “show me what’s resident”, “garbage-collect orphans after a rename”). They are not on the inference data path.
L1 is intentionally not touched. Keys deleted from L2 may still return from L1 until the L1 eviction controller expires them naturally; callers that need an L1+L2 purge should layer their own L1 invalidation or wait for natural L1 eviction.
The coordinator’s eviction loop uses DELETE /l2 automatically (see
Multi-Server Coordination — “L2 usage tracking and eviction”); the
GET /l2/keys endpoint also powers the coordinator’s startup
resync. Manual curl usage is reserved for ad-hoc operator
actions and debugging.
For full request/response semantics, pagination, error codes, and the
event flow back to the coordinator, see the design doc at
docs/design/v1/multiprocess/l2_apis.md.
GET /l2/adapters#
Enumerate every L2 adapter the engine has loaded, in configuration order.
Response (200 OK):
{
"adapters": [
{"index": 0, "type_name": "S3L2Adapter", "primary": true},
{"index": 1, "type_name": "FSL2Adapter", "primary": false}
]
}
primary is true only on the first entry. An engine that has
no L2 backends returns {"adapters": []} (still 200 — the
engine is initialized, it just has no L2 storage).
HTTP status codes:
200: success (including the no-adapters case).503: engine not initialized.
Example:
curl -s http://localhost:8080/l2/adapters | jq
DELETE /l2#
Delete a caller-supplied list of keys from one L2 adapter. Idempotent: keys absent from the adapter are skipped silently; keys currently locked by in-flight store/load tasks are skipped so the delete never corrupts an active transfer. The blocking adapter call is run off the event loop.
Query parameters:
Name |
Default |
Description |
|---|---|---|
|
primary |
|
Per-key successful deletions fire on_l2_keys_deleted on the
adapter’s listeners — when the coordinator is wired (see
--coordinator-l2-event-reporting), the deletions show up at the
coordinator’s POST /l2/events as "type": "delete" events. The
coordinator’s eviction + usage trackers learn about the deletion from
that event flow, not from the response of this call.
Body: {"keys": [EncodedObjectKey, ...]} where each
EncodedObjectKey is
{
"chunk_hash_hex": "abc123...",
"model_name": "meta-llama/Llama-3-8B",
"kv_rank": 0,
"object_group_id": 0,
"cache_salt": "user-a"
}
object_group_id (default 0) and cache_salt (default "")
are optional for backward compatibility with older wire payloads. The
batch is capped at 10000 keys per request.
Response (200 OK):
{
"requested": 2,
"adapter": "S3L2Adapter",
"ok": true
}
On adapter-level failure the response is still 200 with
ok=false and an error field carrying the reason.
HTTP status codes:
200: request reached the adapter (checkokfor outcome).400: batch exceeds the limit, or a key payload violates anObjectKeyinvariant (bad hex,@inmodel_name, forbiddencache_saltcharacter).404:?adapter=<name>does not match any configured adapter.422: Pydantic-level body-shape failure (missingkeys, wrong field types).503: engine not initialized, or no L2 adapters configured.
Example:
curl -s -X DELETE http://localhost:8080/l2 \
-H 'Content-Type: application/json' \
-d '{
"keys": [
{"chunk_hash_hex": "aa", "model_name": "m",
"kv_rank": 0, "object_group_id": 0, "cache_salt": "user-a"}
]
}'
GET /l2/keys#
Paginate keys currently resident in one L2 adapter.
Query parameters:
Name |
Default |
Description |
|---|---|---|
|
primary |
|
|
none |
Restrict the result to keys whose |
|
|
Max entries per page. Must be in |
|
none |
Opaque cursor from the previous page’s |
The page token is private to the adapter; do not parse or modify it.
Adapters that support listing (currently only the S3 adapter via
ListObjectsV2) guarantee best-effort consistency, not snapshot
isolation — concurrent stores or deletes during a paginated walk may
cause keys to appear, disappear, or shift between pages.
Response (200 OK):
{
"adapter": "S3L2Adapter",
"entries": [
{
"key": {
"chunk_hash_hex": "abc123",
"model_name": "meta-llama/Llama-3-8B",
"kv_rank": 0,
"object_group_id": 0,
"cache_salt": "user-a"
},
"size_bytes": 4194304
}
],
"next_page_token": "opaque-cursor-string"
}
next_page_token is null when the listing is exhausted.
HTTP status codes:
200: success.400: malformedpage_token(adapter-level).404:?adapter=<name>does not match any configured adapter.422:page_sizeoutside[1, 5000].501: selected adapter does not implement listing. In v1 onlyS3L2Adapterdoes; adapters wrapped bySerdeL2AdapterWrapperinherit the wrapped adapter’s behavior.503: engine not initialized, or no L2 adapters configured.
Example: paginate every key for a model.
next=""
while :; do
page=$(curl -s "http://localhost:8080/l2/keys?model_name=meta-llama/Llama-3-8B&page_size=500&page_token=$next")
echo "$page" | jq '.entries[]'
next=$(echo "$page" | jq -r '.next_page_token // empty')
[ -z "$next" ] && break
done
POST /l2/prefetch#
Warm one node’s L1 by loading a token sequence’s chunks from L2 ahead of the requests that will use them, so the first request hits L1 instead of paying the L2 fetch inline. Useful when a workload is about to be routed to a node (a traffic shift, a hot shared system prompt).
The caller describes content by token ids, never by internal cache keys (a
key is a content hash plus a per-rank layout bitmap, which callers cannot
construct). The server hashes the tokens, expands each chunk across the node’s
ranks, and submits a warm prefetch: loaded chunks are retained
(permanent) and left unlocked — there is no downstream reader to pin them
for, so a later real lookup takes its own lock. The call returns immediately;
the load runs in the storage manager’s own thread. It coalesces across all
configured L2 adapters, so there is no ?adapter= selector.
Request body:
Field |
Type |
Description |
|---|---|---|
|
string |
The served model id, exactly as registered (e.g. |
|
int |
The value vLLM registered for the node’s KV layout and rank fan-out
( |
|
list[int] |
The prompt token ids. Must use the same tokenizer / special-token
settings as the store, and contain at least one full |
|
string |
Per-tenant key isolation; must match the store (default |
Response (202 Accepted):
{"request_id": "abc123", "chunks": 12, "status": "submitted"}
When the sequence is shorter than one chunk, nothing is submitted and there is
no request_id to poll:
{"chunks": 0, "status": "noop"}
HTTP status codes:
202: submitted (or anoopas above).400:token_idsexceeds the per-request cap, orcache_saltviolates its invariants.409: no layout registered for(model_name, world_size)— the model has not allocated KV cache on this node yet (start vLLM first).422: request body fails field-level validation.503: engine not initialized.
Example:
curl -s -X POST http://localhost:8080/l2/prefetch \
-H 'Content-Type: application/json' \
-d '{"model_name": "Qwen/Qwen3-8B", "world_size": 1,
"token_ids": [101, 102, 103], "cache_salt": "user-a"}'
# -> {"request_id": "abc123", "chunks": 1, "status": "submitted"}
GET /l2/prefetch/{request_id}#
Poll a submitted warm prefetch. The warm holds no lock, so the poll only
reports progress; the first poll that observes completion drops the job
(exactly-once), so a later poll for the same id returns 404.
Response (200 OK) while the load runs:
{"status": "pending"}
…and once complete:
{"status": "completed", "found_keys": 12, "total_keys": 12}
found_keys / total_keys count only chunks loaded from L2 by this
request; chunks already resident in L1 are skipped at reserve_write and not
counted, so a partially-resident warm undercounts by the resident chunk count
(a cold request loads and counts everything). The warm uses the gap-tolerant
SPARSE trim policy, so an already-resident chunk does not stop the rest from
loading — it loads every not-yet-resident chunk and reports that count. Not
counting the resident chunks is deliberate: an already-present entry may be a
transient temporary from another lookup, so claiming it as warmed could mislead.
HTTP status codes:
200: status reported (pendingorcompleted).404: unknownrequest_id— already completed-and-consumed, or never submitted.503: engine not initialized.
Example:
curl -s http://localhost:8080/l2/prefetch/abc123
# -> {"status": "completed", "found_keys": 1, "total_keys": 1}
Quota Management#
These endpoints manage the per-cache_salt storage budgets consumed by
the IsolatedLRU eviction policy (selected via
--eviction-policy IsolatedLRU). Quotas are soft: setting a limit
does not reject writes — any over-budget cache_salt is evicted at
the next eviction cycle (~1 s).
A cache_salt with no registered quota has an effective limit of
0 bytes, so its data is cleared next cycle (allowlist semantics).
These endpoints are no-ops on engines that did not start with
--eviction-policy IsolatedLRU: the QuotaManager is still
present, but the LRU policy ignores the registered quotas.
URL escaping for the empty salt. cache_salt="" (un-salted /
anonymous traffic) cannot appear in a URL path parameter, so the API
accepts the sentinel _default in its place. PUT /quota/_default
sets the quota for cache_salt="", and _default is echoed back in
responses for the empty salt. A user that legitimately stores data with
cache_salt="_default" cannot be managed via this HTTP API distinctly
from anonymous traffic — both map to the same path parameter; pick any
other value (e.g. "default") to disambiguate.
PUT /quota/{cache_salt}#
Create or update a quota.
Body: {"limit_gb": <float>} (required, finite, non-negative).
Response (200 OK):
{"cache_salt": "alice", "limit_gb": 10.0, "status": "ok"}
Errors: 400 for malformed JSON, missing limit_gb, non-numeric
limit_gb, nan / inf, or negative values; 503 if the
engine is not initialized.
Example:
curl -s -X PUT http://localhost:8080/quota/alice \
-H 'Content-Type: application/json' \
-d '{"limit_gb": 10.0}'
GET /quota/{cache_salt}#
Read the current quota and live usage for one cache_salt.
Response (200 OK):
{
"cache_salt": "alice",
"limit_gb": 10.0,
"current_usage_gb": 2.137,
"exists": true
}
exists is false when no quota was ever registered for this
cache_salt (limit_gb is then 0.0 and current_usage_gb
reflects whatever bytes are currently cached for that salt — those bytes
will evict next cycle under IsolatedLRU). This endpoint never returns
404 for an unknown salt.
DELETE /quota/{cache_salt}#
Remove a cache_salt’s quota entry. Any bytes still cached under this
cache_salt become over-budget on the next eviction cycle (effective
limit drops to 0) and will be evicted.
Response (200 OK):
{"cache_salt": "alice", "status": "removed"}
When no quota was registered for the given cache_salt, the response
is {"cache_salt": "...", "status": "not_found"} (still 200 OK).
GET /quota#
List every registered quota alongside its live usage.
Response (200 OK):
{
"users": {
"alice": {"limit_gb": 10.0, "current_usage_gb": 2.137},
"bob": {"limit_gb": 4.0, "current_usage_gb": 0.812}
}
}
Only cache_salt values with a registered quota appear; the empty
salt is reported under the _default key.
Runtime L2 Reconfiguration#
These endpoints are available when the server has a runtime-reconfigurable L2
adapter. They only change LMCache runtime mappings and metadata; backend
resources such as DAX device paths must already exist and be readable and
writable by the server. The endpoint routes backend, operation, and the
JSON request body into the generic L2 adapter reconfiguration API, while
backend-specific validation and migration semantics stay inside the adapter.
backend and operation path segments are normalized (stripped and
lower-cased). Within a request body, adapter_index (default 0) is
backend-local — it indexes only the adapters of that backend, not the
engine-wide adapter list. If an L2 adapter is wrapped by serde, the backend
string is still the configured L2 adapter type, not the serde wrapper type.
GET /reconfigure/backends#
List the backend strings that can be used in
/reconfigure/{backend}/status and
/reconfigure/{backend}/{operation}.
Response (200 OK):
{
"enabled": true,
"num_backends": 1,
"backends": ["dax"]
}
enabled is false (and backends empty) when no reconfigurable
adapter is present.
HTTP status codes: 200 on success; 503 if the engine is not
initialized.
GET /reconfigure/{backend}/status#
Report the runtime-manageable adapters for one backend type. Each adapter
entry’s adapter_index is rewritten to its backend-local 0-based index
(the value to pass back in operation request bodies).
Response (200 OK):
{
"enabled": true,
"backend": "dax",
"num_adapters": 1,
"adapters": [
{"adapter_index": 0, "...": "backend-specific adapter fields"}
]
}
An unknown or empty backend returns enabled=false, num_adapters=0,
adapters=[] (it is not a 404).
HTTP status codes: 200 on success; 400 if backend is empty;
503 if the engine is not initialized.
POST /reconfigure/{backend}/{operation}#
Apply one reconfiguration operation to a backend adapter. The request body is
a JSON object whose accepted fields depend on the backend and operation. The
200 response is whatever the storage manager’s
reconfigure_l2_adapter returns (a backend-defined dict).
For the generic path (any backend other than dax), the body carries
adapter_index plus any backend-specific fields, which are forwarded
verbatim to the adapter.
For Device-DAX (backend=dax), JSON request bodies are used because DAX
paths contain slashes. The accepted operations and fields are:
Operation |
Body fields |
|---|---|
|
|
|
|
|
|
size accepts an integer byte count or a string with a base-1024 unit
suffix (b, kib, mib, gib, tib and the k/m/g/t
aliases), e.g. "100GiB"; it must resolve to a positive value.
HTTP status codes:
200: success (body is the storage manager’s reconfigure result).400: emptybackend/operation, an unsupported DAX operation, or an invalidsize.404:adapter_indexis out of range for the backend.422: request body fails validation (e.g. a missing required field, or an unknown field in a DAX body — DAX bodies reject extras).503: engine not initialized.
See Device-DAX (/dev/dax) for detailed request examples, mode semantics, and validation guidance.
Observability#
GET /metrics#
Prometheus exposition format for every metric registered on the default
prometheus_client registry (Content-Type: text/plain). Scrape this
directly from Prometheus. See Observability for the list of
exported metrics.
Example:
curl -s http://localhost:8080/metrics
POST /metrics/reset#
Resets all LMCache observability metrics to their initial state
(reset_observability_metrics). Intended for test harnesses and
benchmarks — not for production.
Response (200 OK, text/plain):
ok
Example:
curl -s -X POST http://localhost:8080/metrics/reset
Diagnostics and Debugging#
GET /loglevel#
Inspect or mutate Python logger levels at runtime. All responses are
text/plain. The endpoint has three modes driven by query parameters:
Query |
Behavior |
|---|---|
(no params) |
List every logger registered with |
|
Return the effective level of the named logger. |
|
Set the named logger (and its handlers) to |
Passing level without logger_name matches none of the modes and
returns 200 with a null body.
Examples:
# list everything
curl -s http://localhost:8080/loglevel
# read one
curl -s 'http://localhost:8080/loglevel?logger_name=lmcache'
# elevate to DEBUG
curl -s 'http://localhost:8080/loglevel?logger_name=lmcache&level=DEBUG'
GET /threads#
Enumerate active Python threads in the server process along with their
stack traces, plus a total-count summary (Content-Type: text/plain).
Useful for live debugging of hangs or runaway workers.
Query |
Behavior |
|---|---|
|
Keep only threads whose name contains |
|
Keep only the thread with the matching |
Warning
The response contains live stack traces and can disclose internal code paths and state. Restrict network access to this endpoint in production.
Example:
curl -s 'http://localhost:8080/threads?name=periodic'
GET /periodic-threads#
Returns a JSON snapshot of the
PeriodicThreadRegistry: counts by
level plus per-thread status (last run timestamp, latest summary, etc.).
Query |
Behavior |
|---|---|
|
Only include threads at the given level. |
|
Only include threads currently running. |
|
Only include threads considered active (recent tick). |
Response (200 OK):
{
"summary": {
"total_count": 4,
"running_count": 4,
"active_count": 4,
"by_level": {
"critical": {"total": 1, "running": 1, "active": 1},
"high": {"total": 2, "running": 2, "active": 2},
"medium": {"total": 1, "running": 1, "active": 1},
"low": {"total": 0, "running": 0, "active": 0}
}
},
"threads": [
{
"name": "...",
"level": "high",
"interval": 5.0,
"is_running": true,
"is_active": true,
"last_run_ago": 1.2,
"total_runs": 120,
"failed_runs": 0,
"success_rate": 100.0,
"last_summary": {"...": "..."}
}
]
}
Example:
curl -s 'http://localhost:8080/periodic-threads?level=critical' | jq
GET /periodic-threads/{thread_name}#
Detailed status for a single periodic thread (the same per-thread object
shown in the threads list above). Returns 404 with
{"error": "Thread not found: <name>"} if the name is unknown.
Example:
curl -s http://localhost:8080/periodic-threads/storage-flush | jq
GET /periodic-threads-health#
Fast health check covering only critical and high level periodic
threads. A thread is flagged unhealthy when it is marked running but has
not ticked within its expected interval. Always returns 200 — health
is conveyed by the healthy boolean, not the HTTP status.
Response (200 OK):
{
"healthy": true,
"unhealthy_count": 0,
"unhealthy_threads": []
}
When something is lagging:
{
"healthy": false,
"unhealthy_count": 1,
"unhealthy_threads": [
{
"name": "storage-flush",
"level": "critical",
"last_run_ago": 42.5,
"interval": 5.0
}
]
}
Example:
curl -s http://localhost:8080/periodic-threads-health
GET /env#
Dumps the process environment variables as a sorted, pretty-printed
JSON document. The response Content-Type is text/plain so it can be
piped directly to a terminal.
Warning
The payload contains every environment variable, including any secrets injected via the environment. There is no redaction or auth — restrict network access to this endpoint in production.
Example:
curl -s http://localhost:8080/env
POST /run_script#
Execute an uploaded Python script inside the server process. The script is
uploaded as multipart form data under the field name script and is
exec’d with a restricted __builtins__ (only print, str,
int, float, list, dict, tuple, set, and a guarded
__import__). Only modules listed in --script-allowed-imports can be
imported; the running FastAPI app is injected into the script globals.
If the script assigns a variable named result, its stringified value is
returned; otherwise the body is "Script executed successfully"
(Content-Type: text/plain).
Danger
This endpoint runs caller-supplied code in-process. The restricted
builtins are not a security sandbox — combined with the injected
app object and any allowed imports, treat it as full remote code
execution. Never expose it on an untrusted network.
HTTP status codes:
200: script executed.400: noscriptfile provided.500: an exception was raised during import setup or execution (body:"Error executing script: <reason>").
Example:
curl -s -X POST http://localhost:8080/run_script \
-F 'script=@my_script.py'
Adding New Endpoints#
Endpoints are auto-discovered from
lmcache/v1/multiprocess/http_apis/. To add a new MP-only endpoint:
Create a new module in that directory named
<name>_api.py.Define a module-level
router = APIRouter().Register handlers on
routerusing FastAPI decorators.Access the engine via
request.app.state.engineand guard for theNonecase (engine not yet initialized).
The HTTPAPIRegistry
will pick the module up automatically at startup — no central
registration list to edit.
If the route is generic enough to be shared with the vLLM-embedded API
server, add it under lmcache/v1/internal_api_server/common/ instead.
It will be picked up on the MP side via common_api.py unless its
module name is listed in _MP_INCOMPATIBLE_MODULES there (reserved
for modules that require vLLM-specific app.state attributes; the
list is currently empty). A handler that lives under
internal_api_server/vllm/ can still be surfaced on the MP server by
adding a thin re-export shim under http_apis/ (as
version_api.py does for the version endpoints).
When adding a new endpoint, please also add a matching section to this
page documenting the endpoint’s purpose, request/response schema, and
an example curl invocation.