vLLM / Inference APIs#
These APIs are specific to vLLM inference workers and provide cache management, configuration, freeze control, chunk statistics, and version information.
Version & Info#
GET /lmc_version — LMCache Version#
Get the LMCache library version string.
Method:
GETPath:
/lmc_versionParameters: None
Response: Plain version string.
curl http://localhost:7000/lmc_version
GET /commit_id — LMCache Commit ID#
Get the LMCache git commit ID.
Method:
GETPath:
/commit_idParameters: None
Response: Plain commit ID string.
curl http://localhost:7000/commit_id
GET /version — Full Version Info#
Get full version info (version + commit ID).
Method:
GETPath:
/versionParameters: None
Response: Combined version string.
curl http://localhost:7000/version
GET /inference_info — Inference Information#
Get inference information including vLLM config and LMCache details.
Method:
GETPath:
/inference_infoParameters:
Name
Type
Description
formatstr
(Optional) Reserved for future use
Response:
application/json
curl http://localhost:7000/inference_info
Error Response (HTTP 500):
{
"error": "Failed to get inference info",
"message": "..."
}
GET /inference_version — vLLM Version#
Get the vLLM version information.
Method:
GETPath:
/inference_versionParameters: None
Response:
application/json
curl http://localhost:7000/inference_version
Example Response:
{
"vllm_version": "0.8.0"
}
Configuration & Metadata#
GET /conf — Get Configuration#
Get current LMCache engine configuration values.
Method:
GETPath:
/confParameters:
Name
Type
Description
namesstr
(Optional) Comma-separated list of config names to filter
Response:
application/json— JSON object of configuration key-value pairs.
# Get all config
curl http://localhost:7000/conf
# Get specific config values
curl "http://localhost:7000/conf?names=min_retrieve_tokens,save_decode_cache"
POST /conf — Update Configuration (Experimental)#
Update one or more configuration values at runtime.
Warning
This feature is currently experimental. All configuration keys are
mutable at runtime by default unless explicitly marked as
"mutable": False in _CONFIG_DEFINITIONS. The default will be
changed to immutable once the feature is stabilized.
Updating a configuration only modifies the value in the
LMCacheEngineConfig object. If a component has already cached the
value elsewhere, the change will not take effect for that component.
Method:
POSTPath:
/confContent-Type:
application/jsonRequest Body: JSON object with config name-value pairs.
Response:
application/json
curl -X POST http://localhost:7000/conf \
-H "Content-Type: application/json" \
-d '{"min_retrieve_tokens": 512, "save_decode_cache": true}'
Example Response (HTTP 200):
{
"updated": {
"min_retrieve_tokens": 512,
"save_decode_cache": true
}
}
Example Response (partial failure, HTTP 400):
{
"updated": {"min_retrieve_tokens": 512},
"errors": {"unknown_key": "Unknown config"}
}
Error Cases:
Unknown config key →
"Unknown config"Immutable config key →
"Config is not mutable at runtime"Invalid JSON body → HTTP 400
GET /meta — Engine Metadata#
Get metadata of the LMCache engine (e.g., worker_id, model_name, kv_shape).
Method:
GETPath:
/metaParameters:
Name
Type
Description
namesstr
(Optional) Comma-separated list of attribute names to filter
Response:
application/json— JSON object of metadata attributes.
# Get all metadata
curl http://localhost:7000/meta
# Get specific attributes
curl "http://localhost:7000/meta?names=worker_id,model_name"
Cache Operations#
DELETE /cache/clear — Clear Cache#
Clear cached KV data from the LMCache engine.
Method:
DELETEPath:
/cache/clearParameters:
Name
Type
Description
locationslist[str]
(Optional) Storage backends to clear (e.g.
LocalCPUBackend,LocalDiskBackend). If not specified, clears all.Response:
application/json
# Clear all cache
curl -X DELETE http://localhost:7000/cache/clear
# Clear specific backends
curl -X DELETE "http://localhost:7000/cache/clear?locations=LocalCPUBackend&locations=LocalDiskBackend"
Example Response:
{
"status": "success",
"num_removed": 10
}
POST /cache/store — Store KV Cache#
Store KV cache data into the LMCache engine using mock tokens.
Method:
POSTPath:
/cache/storeParameters:
Name
Type
Description
tokens_mockstr
Two comma-separated numbers:
"start,end"(e.g."0,100"generates tokens [0..99])Response:
application/json
curl -X POST "http://localhost:7000/cache/store?tokens_mock=0,100"
Example Response:
{
"status": "success",
"num_tokens": 100
}
Error Response (missing params, HTTP 400):
{
"error": "Missing parameters",
"message": "Must specify either tokens_input or tokens_mock"
}
POST /cache/retrieve — Retrieve KV Cache#
Retrieve KV cache data from the LMCache engine using mock tokens.
Method:
POSTPath:
/cache/retrieveParameters:
Name
Type
Description
tokens_mockstr
Two comma-separated numbers:
"start,end"(e.g."0,100")Response:
application/json
curl -X POST "http://localhost:7000/cache/retrieve?tokens_mock=0,100"
Example Response:
{
"status": "success",
"num_tokens": 100,
"num_retrieved": 80
}
GET /cache/kvcache/check — KVCache Checksum#
Compute MD5 checksums for kvcaches at specified slot_mapping positions. Used for verifying that stored and retrieved kvcaches are identical.
Method:
GETPath:
/cache/kvcache/checkParameters:
Name
Type
Description
slot_mappingstr
Slot indices, comma-separated. Supports ranges:
"0,1,2,3"or"1,2,3,[9,12],17,19"chunk_sizeint
Chunk size for computing per-chunk checksums (required)
layerwisebool
If
true, output per-layer checksums per chunk (default:false)Response:
application/json
# Per-chunk checksum (all layers combined)
curl "http://localhost:7000/cache/kvcache/check?slot_mapping=0,1,2,3&chunk_size=2"
# Per-layer per-chunk checksum
curl "http://localhost:7000/cache/kvcache/check?slot_mapping=0,1,2,3&chunk_size=2&layerwise=true"
Example Response (layerwise=false):
{
"status": "success",
"slot_mapping_ranges": [[0, 3]],
"chunk_size": 2,
"num_chunks": 2,
"chunk_checksums": ["abc123...", "def456..."],
"layerwise": false
}
Example Response (layerwise=true):
{
"status": "success",
"slot_mapping_ranges": [[0, 3]],
"chunk_size": 2,
"num_chunks": 2,
"chunk_checksums": {
"layer_0": ["abc123...", "def456..."],
"layer_1": ["ghi789...", "jkl012..."]
},
"layerwise": true
}
POST /cache/kvcache/record_slot — Toggle Slot Logging#
Enable or disable KVCache slot_mapping logging during store/retrieve operations.
Method:
POSTPath:
/cache/kvcache/record_slotParameters:
Name
Type
Description
enabledstr
"true"to enable,"false"to disable. Omit to query current status.Response:
application/json
# Enable logging
curl -X POST "http://localhost:7000/cache/kvcache/record_slot?enabled=true"
# Disable logging
curl -X POST "http://localhost:7000/cache/kvcache/record_slot?enabled=false"
# Check current status
curl -X POST http://localhost:7000/cache/kvcache/record_slot
Example Response:
{
"status": "success",
"kvcache_check_log_enabled": true
}
GET /cache/kvcache/info — KVCache Information#
Get information about the current kvcaches structure including layer names, shapes, and device info.
Method:
GETPath:
/cache/kvcache/infoParameters: None
Response:
application/json
curl http://localhost:7000/cache/kvcache/info
Example Response:
{
"status": "success",
"num_layers": 32,
"layers": {
"layer_0": {
"shape": [2, 128, 16, 64, 128],
"dtype": "torch.bfloat16",
"device": "cuda:0"
}
}
}
POST /cache/load-fs-chunks — Load FS Chunks#
Load chunk files from FSConnector storage into LocalCPUBackend’s hot cache.
Method:
POSTPath:
/cache/load-fs-chunksContent-Type:
application/jsonRequest Body:
Field
Type
Description
config_pathstr
Path to LMCache engine configuration YAML file (required)
max_chunksint
(Optional) Maximum number of chunks to load
max_failed_keysint
Maximum failed keys to report (default: 10)
Response:
application/jsonTags:
cache-management
curl -X POST http://localhost:7000/cache/load-fs-chunks \
-H "Content-Type: application/json" \
-d '{"config_path": "/path/to/lmcache.yaml", "max_chunks": 100}'
Example Response (HTTP 200):
{
"status": "success",
"loaded_chunks": 95,
"total_files": 100,
"failed_keys": ["key1", "key2"],
"config_path": "/path/to/lmcache.yaml"
}
Error Response (invalid config, HTTP 400):
{
"error": "Failed to load chunks from FSConnector",
"message": "Configuration file not found",
"config_path": "/path/to/lmcache.yaml"
}
Freeze Mode#
PUT /freeze/enable — Enable Freeze Mode#
Enable freeze mode for the LMCache engine. When enabled:
All store operations will be skipped (no new data stored)
Only
local_cpubackend will be used for retrievalNo admit/evict messages will be generated
This protects the local_cpu hot cache from changes.
Method:
PUTPath:
/freeze/enableParameters: None
Response:
application/json
curl -X PUT http://localhost:7000/freeze/enable
Example Response:
{
"status": "success",
"freeze": true,
"message": "Freeze mode enabled successfully"
}
PUT /freeze/disable — Disable Freeze Mode#
Disable freeze mode. Store operations will proceed normally.
Method:
PUTPath:
/freeze/disableParameters: None
Response:
application/json
curl -X PUT http://localhost:7000/freeze/disable
Example Response:
{
"status": "success",
"freeze": false,
"message": "Freeze mode disabled successfully"
}
GET /freeze/status — Freeze Status#
Get the current freeze mode status.
Method:
GETPath:
/freeze/statusParameters: None
Response:
application/json
curl http://localhost:7000/freeze/status
Example Response:
{
"status": "success",
"freeze": true,
"message": "Freeze mode is enabled"
}
Chunk Statistics#
These endpoints manage chunk-level statistics collection via
ChunkStatisticsLookupClient. They are only available when the
lookup client supports statistics.
POST /chunk_statistics/start — Start Statistics#
Start collecting chunk statistics.
Method:
POSTPath:
/chunk_statistics/startParameters: None
Response:
application/json
curl -X POST http://localhost:7000/chunk_statistics/start
Example Response:
{
"status": "success",
"message": "Started"
}
Error Response (not supported, HTTP 400):
{
"error": "Not available",
"message": "Client does not support statistics."
}
POST /chunk_statistics/stop — Stop Statistics#
Stop collecting chunk statistics.
Method:
POSTPath:
/chunk_statistics/stopParameters: None
Response:
application/json
curl -X POST http://localhost:7000/chunk_statistics/stop
Example Response:
{
"status": "success",
"message": "Stopped"
}
POST /chunk_statistics/reset — Reset Statistics#
Reset all collected chunk statistics.
Method:
POSTPath:
/chunk_statistics/resetParameters: None
Response:
application/json
curl -X POST http://localhost:7000/chunk_statistics/reset
Example Response:
{
"status": "success",
"message": "Reset"
}
GET /chunk_statistics/status — Statistics Status#
Get current chunk statistics and auto-exit configuration.
Method:
GETPath:
/chunk_statistics/statusParameters: None
Response:
application/json
curl http://localhost:7000/chunk_statistics/status
Example Response:
{
"is_collecting": true,
"total_chunks": 1000,
"unique_chunks": 500,
"timestamp": 1706745600.0,
"auto_exit_enabled": true,
"auto_exit_timeout_hours": 24.0,
"auto_exit_target_unique_chunks": 1000
}