vLLM / Inference APIs#
These APIs are specific to vLLM inference workers and provide cache management, configuration, freeze control, chunk statistics, and version information.
Version & Info#
GET /lmc_version — LMCache Version#
Get the LMCache library version string.
Method:
GETPath:
/lmc_versionParameters: None
Response: Plain version string.
curl http://localhost:7000/lmc_version
GET /commit_id — LMCache Commit ID#
Get the LMCache git commit ID.
Method:
GETPath:
/commit_idParameters: None
Response: Plain commit ID string.
curl http://localhost:7000/commit_id
GET /version — Full Version Info#
Get full version info (version + commit ID).
Method:
GETPath:
/versionParameters: None
Response: Combined version string.
curl http://localhost:7000/version
GET /inference_info — Inference Information#
Get inference information including vLLM config and LMCache details.
Method:
GETPath:
/inference_infoParameters:
Name
Type
Description
formatstr
(Optional) Reserved for future use
Response:
application/json
curl http://localhost:7000/inference_info
Error Response (HTTP 500):
{
"error": "Failed to get inference info",
"message": "..."
}
GET /inference_version — vLLM Version#
Get the vLLM version information.
Method:
GETPath:
/inference_versionParameters: None
Response:
application/json
curl http://localhost:7000/inference_version
Example Response:
{
"vllm_version": "0.8.0"
}
Configuration & Metadata#
GET /conf — Get Configuration#
Get current LMCache engine configuration values.
Method:
GETPath:
/confParameters:
Name
Type
Description
namesstr
(Optional) Comma-separated list of config names to filter
Response:
application/json— JSON object of configuration key-value pairs.
# Get all config
curl http://localhost:7000/conf
# Get specific config values
curl "http://localhost:7000/conf?names=min_retrieve_tokens,save_decode_cache"
POST /conf — Update Configuration (Experimental)#
Update one or more configuration values at runtime.
Warning
This feature is currently experimental. All configuration keys are
mutable at runtime by default unless explicitly marked as
"mutable": False in _CONFIG_DEFINITIONS. The default will be
changed to immutable once the feature is stabilized.
Updating a configuration only modifies the value in the
LMCacheEngineConfig object. If a component has already cached the
value elsewhere, the change will not take effect for that component.
Method:
POSTPath:
/confContent-Type:
application/jsonRequest Body: JSON object with config name-value pairs.
Response:
application/json
curl -X POST http://localhost:7000/conf \
-H "Content-Type: application/json" \
-d '{"min_retrieve_tokens": 512, "save_decode_cache": true}'
Example Response (HTTP 200):
{
"updated": {
"min_retrieve_tokens": 512,
"save_decode_cache": true
}
}
Example Response (partial failure, HTTP 400):
{
"updated": {"min_retrieve_tokens": 512},
"errors": {"unknown_key": "Unknown config"}
}
Error Cases:
Unknown config key →
"Unknown config"Immutable config key →
"Config is not mutable at runtime"Invalid JSON body → HTTP 400
GET /meta — Engine Metadata#
Get metadata of the LMCache engine (e.g., worker_id, model_name, kv_shape).
Method:
GETPath:
/metaParameters:
Name
Type
Description
namesstr
(Optional) Comma-separated list of attribute names to filter
Response:
application/json— JSON object of metadata attributes.
# Get all metadata
curl http://localhost:7000/meta
# Get specific attributes
curl "http://localhost:7000/meta?names=worker_id,model_name"
Cache Operations#
DELETE /cache/clear — Clear Cache#
Clear cached KV data from the LMCache engine.
Method:
DELETEPath:
/cache/clearParameters:
Name
Type
Description
locationslist[str]
(Optional) Storage backends to clear (e.g.
LocalCPUBackend,LocalDiskBackend). If not specified, clears all.Response:
application/json
# Clear all cache
curl -X DELETE http://localhost:7000/cache/clear
# Clear specific backends
curl -X DELETE "http://localhost:7000/cache/clear?locations=LocalCPUBackend&locations=LocalDiskBackend"
Example Response:
{
"status": "success",
"num_removed": 10
}
POST /cache/store — Store KV Cache#
Store KV cache data into the LMCache engine using mock tokens.
Method:
POSTPath:
/cache/storeParameters:
Name
Type
Description
tokens_mockstr
Two comma-separated numbers:
"start,end"(e.g."0,100"generates tokens [0..99])Response:
application/json
curl -X POST "http://localhost:7000/cache/store?tokens_mock=0,100"
Example Response:
{
"status": "success",
"num_tokens": 100
}
Error Response (missing params, HTTP 400):
{
"error": "Missing parameters",
"message": "Must specify either tokens_input or tokens_mock"
}
POST /cache/retrieve — Retrieve KV Cache#
Retrieve KV cache data from the LMCache engine using mock tokens.
Method:
POSTPath:
/cache/retrieveParameters:
Name
Type
Description
tokens_mockstr
Two comma-separated numbers:
"start,end"(e.g."0,100")Response:
application/json
curl -X POST "http://localhost:7000/cache/retrieve?tokens_mock=0,100"
Example Response:
{
"status": "success",
"num_tokens": 100,
"num_retrieved": 80
}
GET /cache/kvcache/check — KVCache Checksum#
Compute MD5 checksums for kvcaches at specified slot_mapping positions. Used for verifying that stored and retrieved kvcaches are identical.
Method:
GETPath:
/cache/kvcache/checkParameters:
Name
Type
Description
slot_mappingstr
Slot indices, comma-separated. Supports ranges:
"0,1,2,3"or"1,2,3,[9,12],17,19"chunk_sizeint
Chunk size for computing per-chunk checksums (required)
layerwisebool
If
true, output per-layer checksums per chunk (default:false)Response:
application/json
# Per-chunk checksum (all layers combined)
curl "http://localhost:7000/cache/kvcache/check?slot_mapping=0,1,2,3&chunk_size=2"
# Per-layer per-chunk checksum
curl "http://localhost:7000/cache/kvcache/check?slot_mapping=0,1,2,3&chunk_size=2&layerwise=true"
Example Response (layerwise=false):
{
"status": "success",
"slot_mapping_ranges": [[0, 3]],
"chunk_size": 2,
"num_chunks": 2,
"chunk_checksums": ["abc123...", "def456..."],
"layerwise": false
}
Example Response (layerwise=true):
{
"status": "success",
"slot_mapping_ranges": [[0, 3]],
"chunk_size": 2,
"num_chunks": 2,
"chunk_checksums": {
"layer_0": ["abc123...", "def456..."],
"layer_1": ["ghi789...", "jkl012..."]
},
"layerwise": true
}
POST /cache/kvcache/record_slot — Toggle Slot Logging#
Enable or disable KVCache slot_mapping logging during store/retrieve operations.
Method:
POSTPath:
/cache/kvcache/record_slotParameters:
Name
Type
Description
enabledstr
"true"to enable,"false"to disable. Omit to query current status.Response:
application/json
# Enable logging
curl -X POST "http://localhost:7000/cache/kvcache/record_slot?enabled=true"
# Disable logging
curl -X POST "http://localhost:7000/cache/kvcache/record_slot?enabled=false"
# Check current status
curl -X POST http://localhost:7000/cache/kvcache/record_slot
Example Response:
{
"status": "success",
"kvcache_check_log_enabled": true
}
GET /cache/kvcache/info — KVCache Information#
Get information about the current kvcaches structure including layer names, shapes, and device info.
Method:
GETPath:
/cache/kvcache/infoParameters: None
Response:
application/json
curl http://localhost:7000/cache/kvcache/info
Example Response:
{
"status": "success",
"num_layers": 32,
"layers": {
"layer_0": {
"shape": [2, 128, 16, 64, 128],
"dtype": "torch.bfloat16",
"device": "cuda:0"
}
}
}
POST /cache/load-fs-chunks — Load FS Chunks#
Load chunk files from FSConnector storage into LocalCPUBackend’s hot cache.
Method:
POSTPath:
/cache/load-fs-chunksContent-Type:
application/jsonRequest Body:
Field
Type
Description
config_pathstr
Path to LMCache engine configuration YAML file (required)
max_chunksint
(Optional) Maximum number of chunks to load
max_failed_keysint
Maximum failed keys to report (default: 10)
Response:
application/jsonTags:
cache-management
curl -X POST http://localhost:7000/cache/load-fs-chunks \
-H "Content-Type: application/json" \
-d '{"config_path": "/path/to/lmcache.yaml", "max_chunks": 100}'
Example Response (HTTP 200):
{
"status": "success",
"loaded_chunks": 95,
"total_files": 100,
"failed_keys": ["key1", "key2"],
"config_path": "/path/to/lmcache.yaml"
}
Error Response (invalid config, HTTP 400):
{
"error": "Failed to load chunks from FSConnector",
"message": "Configuration file not found",
"config_path": "/path/to/lmcache.yaml"
}
Freeze Mode#
PUT /freeze/enable — Enable Freeze Mode#
Enable freeze mode for the LMCache engine. When enabled:
All store operations will be skipped (no new data stored)
Only
local_cpubackend will be used for retrievalNo admit/evict messages will be generated
This protects the local_cpu hot cache from changes.
Method:
PUTPath:
/freeze/enableParameters: None
Response:
application/json
curl -X PUT http://localhost:7000/freeze/enable
Example Response:
{
"status": "success",
"freeze": true,
"message": "Freeze mode enabled successfully"
}
PUT /freeze/disable — Disable Freeze Mode#
Disable freeze mode. Store operations will proceed normally.
Method:
PUTPath:
/freeze/disableParameters: None
Response:
application/json
curl -X PUT http://localhost:7000/freeze/disable
Example Response:
{
"status": "success",
"freeze": false,
"message": "Freeze mode disabled successfully"
}
GET /freeze/status — Freeze Status#
Get the current freeze mode status.
Method:
GETPath:
/freeze/statusParameters: None
Response:
application/json
curl http://localhost:7000/freeze/status
Example Response:
{
"status": "success",
"freeze": true,
"message": "Freeze mode is enabled"
}
Hot Cache#
These endpoints control the hot cache feature of LocalCPUBackend. When hot cache is enabled, frequently accessed KV cache data will be kept in CPU memory for faster retrieval.
PUT /hot_cache/enable — Enable Hot Cache#
Enable hot cache for the LocalCPUBackend.
Method:
PUTPath:
/hot_cache/enableParameters: None
Response:
application/json
curl -X PUT http://localhost:7000/hot_cache/enable
Example Response:
{
"status": "success",
"hot_cache": true,
"message": "Hot cache enabled successfully"
}
PUT /hot_cache/disable — Disable Hot Cache#
Disable hot cache for the LocalCPUBackend. Existing hot cache entries will be cleared and no new data will be written.
Method:
PUTPath:
/hot_cache/disableParameters: None
Response:
application/json
curl -X PUT http://localhost:7000/hot_cache/disable
Example Response:
{
"status": "success",
"hot_cache": false,
"message": "Hot cache disabled successfully"
}
GET /hot_cache/status — Hot Cache Status#
Get the current hot cache status of LocalCPUBackend.
Method:
GETPath:
/hot_cache/statusParameters: None
Response:
application/json
curl http://localhost:7000/hot_cache/status
Example Response:
{
"status": "success",
"hot_cache": true,
"message": "Hot cache is enabled"
}
Chunk Statistics#
These endpoints manage chunk-level statistics collection via
ChunkStatisticsLookupClient. They are only available when the
lookup client supports statistics.
Lookup Client/Server Management#
These endpoints allow runtime management of the lookup client and server. They are useful for dynamically reconfiguring the lookup mechanism without restarting the service.
Important
Configuration Update Required First
Before calling /lookup/create or /lookup/recreate, you MUST
update the configuration via the /conf API first. The new lookup
client/server will be created using LookupClientFactory.
For some configurations (e.g., switching enable_scheduler_bypass_lookup),
you only need to update the scheduler’s configuration and recreate its
lookup client. The workers don’t need changes in this case.
GET /lookup/info — Lookup Client/Server Information#
Get information about the current lookup client and server status.
Shows wrapper chain if applicable (e.g., HitLimitLookupClient(LMCacheLookupClient)).
Method:
GETPath:
/lookup/infoParameters: None
Response:
application/json
curl http://localhost:6999/lookup/info
Example Response (scheduler):
{
"client": "HitLimitLookupClient(LMCacheBypassLookupClient)",
"server": "None",
"role": "scheduler"
}
Example Response (worker):
{
"client": "None",
"server": "LMCacheLookupServer",
"role": "worker"
}
POST /lookup/close — Close Lookup Client/Server#
Close the current lookup client (scheduler) or server (worker).
Method:
POSTPath:
/lookup/closeParameters: None
Response:
application/json
curl -X POST http://localhost:6999/lookup/close
Example Response:
{
"old": "HitLimitLookupClient(LMCacheBypassLookupClient)",
"role": "scheduler"
}
POST /lookup/create — Create Lookup Client/Server#
Create a new lookup client (scheduler) or server (worker) using current config.
Method:
POSTPath:
/lookup/createParameters:
Name
Type
Description
dryrunbool
If
true, only show what would be createdResponse:
application/json
# Dryrun - preview what would be created
curl -X POST "http://localhost:6999/lookup/create?dryrun=true"
# Actually create
curl -X POST http://localhost:6999/lookup/create
Example Response (dryrun):
{
"new": "LMCacheLookupClient",
"dryrun": true,
"role": "scheduler"
}
Example Response (actual create):
{
"new": "LMCacheLookupClient",
"role": "scheduler"
}
POST /lookup/recreate — Recreate Lookup Client/Server#
Recreate the lookup client or server (equivalent to close + create). The endpoint automatically determines which component based on role:
scheduler role: recreates lookup client
worker role: recreates lookup server
Method:
POSTPath:
/lookup/recreateParameters: None
Response:
application/json
Usage Flow:
# Step 1: Update worker configuration (if needed)
curl -X POST "http://localhost:7000/conf" \
-H "Content-Type: application/json" \
-d '{"enable_async_loading": true}'
# Step 2: Recreate lookup server on worker
curl -X POST "http://localhost:7000/lookup/recreate"
# Step 3: Update scheduler configuration
curl -X POST "http://localhost:6999/conf" \
-H "Content-Type: application/json" \
-d '{"enable_scheduler_bypass_lookup": true}'
# Step 4: Recreate lookup client on scheduler
curl -X POST "http://localhost:6999/lookup/recreate"
Example Response (scheduler):
{
"old": "HitLimitLookupClient(LMCacheBypassLookupClient)",
"new": "LMCacheLookupClient",
"role": "scheduler"
}
Example Response (worker):
{
"old": "LMCacheLookupServer",
"new": "LMCacheAsyncLookupServer",
"role": "worker"
}
Note
Client-only Changes
For some configuration changes (e.g., switching enable_scheduler_bypass_lookup),
you only need to update the scheduler’s configuration and recreate its lookup
client. Worker-side lookup servers don’t need to be recreated in this case.
POST /chunk_statistics/start — Start Statistics#
Start collecting chunk statistics.
Method:
POSTPath:
/chunk_statistics/startParameters: None
Response:
application/json
curl -X POST http://localhost:7000/chunk_statistics/start
Example Response:
{
"status": "success",
"message": "Started"
}
Error Response (not supported, HTTP 400):
{
"error": "Not available",
"message": "Client does not support statistics."
}
POST /chunk_statistics/stop — Stop Statistics#
Stop collecting chunk statistics.
Method:
POSTPath:
/chunk_statistics/stopParameters: None
Response:
application/json
curl -X POST http://localhost:7000/chunk_statistics/stop
Example Response:
{
"status": "success",
"message": "Stopped"
}
POST /chunk_statistics/reset — Reset Statistics#
Reset all collected chunk statistics.
Method:
POSTPath:
/chunk_statistics/resetParameters: None
Response:
application/json
curl -X POST http://localhost:7000/chunk_statistics/reset
Example Response:
{
"status": "success",
"message": "Reset"
}
GET /chunk_statistics/status — Statistics Status#
Get current chunk statistics and auto-exit configuration.
Method:
GETPath:
/chunk_statistics/statusParameters: None
Response:
application/json
curl http://localhost:7000/chunk_statistics/status
Example Response:
{
"is_collecting": true,
"total_chunks": 1000,
"unique_chunks": 500,
"timestamp": 1706745600.0,
"auto_exit_enabled": true,
"auto_exit_timeout_hours": 24.0,
"auto_exit_target_unique_chunks": 1000
}
Bypass Mode#
Bypass mode allows dynamically skipping specific storage backends at runtime.
Bypassed backends are excluded from contains/put/get operations.
This is useful for fault injection testing, isolating a problematic backend,
or debugging without restarting the engine.
GET /bypass/list — List Bypassed Backends#
List all currently bypassed backends and all available backend names.
Method:
GETPath:
/bypass/listParameters: None
Response:
application/json
curl http://localhost:7000/bypass/list
Example Response:
{
"status": "success",
"bypassed_backends": ["RemoteBackend"],
"all_backends": ["LocalCPUBackend", "RemoteBackend"]
}
PUT /bypass/add — Add a Backend to Bypass List#
Add a backend to the bypass list. The bypassed backend will be excluded
from contains/put/get operations.
Method:
PUTPath:
/bypass/addParameters:
Name
Type
Description
backend_namestr
Name of the backend to bypass (required)
Response:
application/json
curl -X PUT "http://localhost:7000/bypass/add?backend_name=RemoteBackend"
Example Response:
{
"status": "success",
"backend_name": "RemoteBackend",
"bypassed": true,
"was_already_bypassed": false,
"bypassed_backends": ["RemoteBackend"]
}
Error Response (unknown backend, HTTP 400):
{
"error": "Unknown backend",
"message": "Backend 'FooBackend' not found. Available: ['LocalCPUBackend', 'RemoteBackend']"
}
PUT /bypass/remove — Remove a Backend from Bypass List#
Remove a backend from the bypass list, restoring it to normal operation.
Method:
PUTPath:
/bypass/removeParameters:
Name
Type
Description
backend_namestr
Name of the backend to restore (required)
Response:
application/json
curl -X PUT "http://localhost:7000/bypass/remove?backend_name=RemoteBackend"
Example Response:
{
"status": "success",
"backend_name": "RemoteBackend",
"bypassed": false,
"was_bypassed": true,
"bypassed_backends": []
}
Error Response (unknown backend, HTTP 400):
{
"error": "Unknown backend",
"message": "Backend 'FooBackend' not found. Available: ['LocalCPUBackend', 'RemoteBackend']"
}