vLLM / Inference APIs#

These APIs are specific to vLLM inference workers and provide cache management, configuration, freeze control, chunk statistics, and version information.

Version & Info #

`GET /lmc_version` — LMCache Version #

Get the LMCache library version string.

Method: GET
Path: /lmc_version
Parameters: None
Response: Plain version string.

curl http://localhost:7000/lmc_version

`GET /commit_id` — LMCache Commit ID #

Get the LMCache git commit ID.

Method: GET
Path: /commit_id
Parameters: None
Response: Plain commit ID string.

curl http://localhost:7000/commit_id

`GET /version` — Full Version Info #

Get full version info (version + commit ID).

Method: GET
Path: /version
Parameters: None
Response: Combined version string.

curl http://localhost:7000/version

`GET /inference_info` — Inference Information #

Get inference information including vLLM config and LMCache details.

Method: GET
Path: /inference_info
Parameters:

Name

Type

Description

format

str

(Optional) Reserved for future use
Response: application/json

curl http://localhost:7000/inference_info

Error Response (HTTP 500):

{
  "error": "Failed to get inference info",
  "message": "..."
}

`GET /inference_version` — vLLM Version #

Get the vLLM version information.

Method: GET
Path: /inference_version
Parameters: None
Response: application/json

curl http://localhost:7000/inference_version

Example Response:

{
  "vllm_version": "0.8.0"
}

Configuration & Metadata #

`GET /conf` — Get Configuration #

Get current LMCache engine configuration values.

Method: GET
Path: /conf
Parameters:

Name

Type

Description

names

str

(Optional) Comma-separated list of config names to filter
Response: application/json — JSON object of configuration key-value pairs.

# Get all config
curl http://localhost:7000/conf

# Get specific config values
curl "http://localhost:7000/conf?names=min_retrieve_tokens,save_decode_cache"

`POST /conf` — Update Configuration (Experimental)#

Update one or more configuration values at runtime.

Warning

This feature is currently experimental. All configuration keys are mutable at runtime by default unless explicitly marked as "mutable": False in _CONFIG_DEFINITIONS. The default will be changed to immutable once the feature is stabilized.

Updating a configuration only modifies the value in the LMCacheEngineConfig object. If a component has already cached the value elsewhere, the change will not take effect for that component.

Method: POST
Path: /conf
Content-Type: application/json
Request Body: JSON object with config name-value pairs.
Response: application/json

curl -X POST http://localhost:7000/conf \
  -H "Content-Type: application/json" \
  -d '{"min_retrieve_tokens": 512, "save_decode_cache": true}'

Example Response (HTTP 200):

{
  "updated": {
    "min_retrieve_tokens": 512,
    "save_decode_cache": true
  }
}

Example Response (partial failure, HTTP 400):

{
  "updated": {"min_retrieve_tokens": 512},
  "errors": {"unknown_key": "Unknown config"}
}

Error Cases:

Unknown config key → "Unknown config"
Immutable config key → "Config is not mutable at runtime"
Invalid JSON body → HTTP 400

`GET /meta` — Engine Metadata #

Get metadata of the LMCache engine (e.g., worker_id, model_name, kv_shape).

Method: GET
Path: /meta
Parameters:

Name

Type

Description

names

str

(Optional) Comma-separated list of attribute names to filter
Response: application/json — JSON object of metadata attributes.

# Get all metadata
curl http://localhost:7000/meta

# Get specific attributes
curl "http://localhost:7000/meta?names=worker_id,model_name"

Cache Operations #

`DELETE /cache/clear` — Clear Cache #

Clear cached KV data from the LMCache engine.

Method: DELETE
Path: /cache/clear
Parameters:

Name

Type

Description

locations

list[str]

(Optional) Storage backends to clear (e.g. LocalCPUBackend, LocalDiskBackend). If not specified, clears all.
Response: application/json

# Clear all cache
curl -X DELETE http://localhost:7000/cache/clear

# Clear specific backends
curl -X DELETE "http://localhost:7000/cache/clear?locations=LocalCPUBackend&locations=LocalDiskBackend"

Example Response:

{
  "status": "success",
  "num_removed": 10
}

`POST /cache/store` — Store KV Cache #

Store KV cache data into the LMCache engine using mock tokens.

Method: POST
Path: /cache/store
Parameters:

Name

Type

Description

tokens_mock

str

Two comma-separated numbers: "start,end" (e.g. "0,100" generates tokens [0..99])
Response: application/json

curl -X POST "http://localhost:7000/cache/store?tokens_mock=0,100"

Example Response:

{
  "status": "success",
  "num_tokens": 100
}

Error Response (missing params, HTTP 400):

{
  "error": "Missing parameters",
  "message": "Must specify either tokens_input or tokens_mock"
}

`POST /cache/retrieve` — Retrieve KV Cache #

Retrieve KV cache data from the LMCache engine using mock tokens.

Method: POST
Path: /cache/retrieve
Parameters:

Name

Type

Description

tokens_mock

str

Two comma-separated numbers: "start,end" (e.g. "0,100")
Response: application/json

curl -X POST "http://localhost:7000/cache/retrieve?tokens_mock=0,100"

Example Response:

{
  "status": "success",
  "num_tokens": 100,
  "num_retrieved": 80
}

`GET /cache/kvcache/check` — KVCache Checksum #

Compute MD5 checksums for kvcaches at specified slot_mapping positions. Used for verifying that stored and retrieved kvcaches are identical.

Method: GET
Path: /cache/kvcache/check

Parameters:

Name	Type	Description
`slot_mapping`	str	Slot indices, comma-separated. Supports ranges: `"0,1,2,3"` or `"1,2,3,[9,12],17,19"`
`chunk_size`	int	Chunk size for computing per-chunk checksums (required)
`layerwise`	bool	If `true`, output per-layer checksums per chunk (default: `false`)

Response: application/json

# Per-chunk checksum (all layers combined)
curl "http://localhost:7000/cache/kvcache/check?slot_mapping=0,1,2,3&chunk_size=2"

# Per-layer per-chunk checksum
curl "http://localhost:7000/cache/kvcache/check?slot_mapping=0,1,2,3&chunk_size=2&layerwise=true"

Example Response (layerwise=false):

{
  "status": "success",
  "slot_mapping_ranges": [[0, 3]],
  "chunk_size": 2,
  "num_chunks": 2,
  "chunk_checksums": ["abc123...", "def456..."],
  "layerwise": false
}

Example Response (layerwise=true):

{
  "status": "success",
  "slot_mapping_ranges": [[0, 3]],
  "chunk_size": 2,
  "num_chunks": 2,
  "chunk_checksums": {
    "layer_0": ["abc123...", "def456..."],
    "layer_1": ["ghi789...", "jkl012..."]
  },
  "layerwise": true
}

`POST /cache/kvcache/record_slot` — Toggle Slot Logging #

Enable or disable KVCache slot_mapping logging during store/retrieve operations.

Method: POST
Path: /cache/kvcache/record_slot
Parameters:

Name

Type

Description

enabled

str

"true" to enable, "false" to disable. Omit to query current status.
Response: application/json

# Enable logging
curl -X POST "http://localhost:7000/cache/kvcache/record_slot?enabled=true"

# Disable logging
curl -X POST "http://localhost:7000/cache/kvcache/record_slot?enabled=false"

# Check current status
curl -X POST http://localhost:7000/cache/kvcache/record_slot

Example Response:

{
  "status": "success",
  "kvcache_check_log_enabled": true
}

`GET /cache/kvcache/info` — KVCache Information #

Get information about the current kvcaches structure including layer names, shapes, and device info.

Method: GET
Path: /cache/kvcache/info
Parameters: None
Response: application/json

curl http://localhost:7000/cache/kvcache/info

Example Response:

{
  "status": "success",
  "num_layers": 32,
  "layers": {
    "layer_0": {
      "shape": [2, 128, 16, 64, 128],
      "dtype": "torch.bfloat16",
      "device": "cuda:0"
    }
  }
}

`POST /cache/load-fs-chunks` — Load FS Chunks #

Load chunk files from FSConnector storage into LocalCPUBackend’s hot cache.

Method: POST
Path: /cache/load-fs-chunks
Content-Type: application/json

Request Body:

Field	Type	Description
`config_path`	str	Path to LMCache engine configuration YAML file (required)
`max_chunks`	int	(Optional) Maximum number of chunks to load
`max_failed_keys`	int	Maximum failed keys to report (default: 10)

Response: application/json
Tags: cache-management

curl -X POST http://localhost:7000/cache/load-fs-chunks \
  -H "Content-Type: application/json" \
  -d '{"config_path": "/path/to/lmcache.yaml", "max_chunks": 100}'

Example Response (HTTP 200):

{
  "status": "success",
  "loaded_chunks": 95,
  "total_files": 100,
  "failed_keys": ["key1", "key2"],
  "config_path": "/path/to/lmcache.yaml"
}

Error Response (invalid config, HTTP 400):

{
  "error": "Failed to load chunks from FSConnector",
  "message": "Configuration file not found",
  "config_path": "/path/to/lmcache.yaml"
}

Freeze Mode #

`PUT /freeze/enable` — Enable Freeze Mode #

Enable freeze mode for the LMCache engine. When enabled:

All store operations will be skipped (no new data stored)
Only local_cpu backend will be used for retrieval
No admit/evict messages will be generated

This protects the local_cpu hot cache from changes.

Method: PUT
Path: /freeze/enable
Parameters: None
Response: application/json

curl -X PUT http://localhost:7000/freeze/enable

Example Response:

{
  "status": "success",
  "freeze": true,
  "message": "Freeze mode enabled successfully"
}

`PUT /freeze/disable` — Disable Freeze Mode #

Disable freeze mode. Store operations will proceed normally.

Method: PUT
Path: /freeze/disable
Parameters: None
Response: application/json

curl -X PUT http://localhost:7000/freeze/disable

Example Response:

{
  "status": "success",
  "freeze": false,
  "message": "Freeze mode disabled successfully"
}

`GET /freeze/status` — Freeze Status #

Get the current freeze mode status.

Method: GET
Path: /freeze/status
Parameters: None
Response: application/json

curl http://localhost:7000/freeze/status

Example Response:

{
  "status": "success",
  "freeze": true,
  "message": "Freeze mode is enabled"
}

Hot Cache #

These endpoints control the hot cache feature of LocalCPUBackend. When hot cache is enabled, frequently accessed KV cache data will be kept in CPU memory for faster retrieval.

`PUT /hot_cache/enable` — Enable Hot Cache #

Enable hot cache for the LocalCPUBackend.

Method: PUT
Path: /hot_cache/enable
Parameters: None
Response: application/json

curl -X PUT http://localhost:7000/hot_cache/enable

Example Response:

{
  "status": "success",
  "hot_cache": true,
  "message": "Hot cache enabled successfully"
}

`PUT /hot_cache/disable` — Disable Hot Cache #

Disable hot cache for the LocalCPUBackend. Existing hot cache entries will be cleared and no new data will be written.

Method: PUT
Path: /hot_cache/disable
Parameters: None
Response: application/json

curl -X PUT http://localhost:7000/hot_cache/disable

Example Response:

{
  "status": "success",
  "hot_cache": false,
  "message": "Hot cache disabled successfully"
}

`GET /hot_cache/status` — Hot Cache Status #

Get the current hot cache status of LocalCPUBackend.

Method: GET
Path: /hot_cache/status
Parameters: None
Response: application/json

curl http://localhost:7000/hot_cache/status

Example Response:

{
  "status": "success",
  "hot_cache": true,
  "message": "Hot cache is enabled"
}

Chunk Statistics #

These endpoints manage chunk-level statistics collection via ChunkStatisticsLookupClient. They are only available when the lookup client supports statistics.

Lookup Client/Server Management #

These endpoints allow runtime management of the lookup client and server. They are useful for dynamically reconfiguring the lookup mechanism without restarting the service.

Important

Configuration Update Required First

Before calling /lookup/create or /lookup/recreate, you MUST update the configuration via the /conf API first. The new lookup client/server will be created using LookupClientFactory.

For some configurations (e.g., switching enable_scheduler_bypass_lookup), you only need to update the scheduler’s configuration and recreate its lookup client. The workers don’t need changes in this case.

`GET /lookup/info` — Lookup Client/Server Information #

Get information about the current lookup client and server status. Shows wrapper chain if applicable (e.g., HitLimitLookupClient(LMCacheLookupClient)).

Method: GET
Path: /lookup/info
Parameters: None
Response: application/json

curl http://localhost:6999/lookup/info

Example Response (scheduler):

{
  "client": "HitLimitLookupClient(LMCacheBypassLookupClient)",
  "server": "None",
  "role": "scheduler"
}

Example Response (worker):

{
  "client": "None",
  "server": "LMCacheLookupServer",
  "role": "worker"
}

`POST /lookup/close` — Close Lookup Client/Server #

Close the current lookup client (scheduler) or server (worker).

Method: POST
Path: /lookup/close
Parameters: None
Response: application/json

curl -X POST http://localhost:6999/lookup/close

Example Response:

{
  "old": "HitLimitLookupClient(LMCacheBypassLookupClient)",
  "role": "scheduler"
}

`POST /lookup/create` — Create Lookup Client/Server #

Create a new lookup client (scheduler) or server (worker) using current config.

Method: POST
Path: /lookup/create
Parameters:

Name

Type

Description

dryrun

bool

If true, only show what would be created
Response: application/json

# Dryrun - preview what would be created
curl -X POST "http://localhost:6999/lookup/create?dryrun=true"

# Actually create
curl -X POST http://localhost:6999/lookup/create

Example Response (dryrun):

{
  "new": "LMCacheLookupClient",
  "dryrun": true,
  "role": "scheduler"
}

Example Response (actual create):

{
  "new": "LMCacheLookupClient",
  "role": "scheduler"
}

`POST /lookup/recreate` — Recreate Lookup Client/Server #

Recreate the lookup client or server (equivalent to close + create). The endpoint automatically determines which component based on role:

scheduler role: recreates lookup client
worker role: recreates lookup server
Method: POST
Path: /lookup/recreate
Parameters: None
Response: application/json

Usage Flow:

# Step 1: Update worker configuration (if needed)
curl -X POST "http://localhost:7000/conf" \
  -H "Content-Type: application/json" \
  -d '{"enable_async_loading": true}'

# Step 2: Recreate lookup server on worker
curl -X POST "http://localhost:7000/lookup/recreate"

# Step 3: Update scheduler configuration
curl -X POST "http://localhost:6999/conf" \
  -H "Content-Type: application/json" \
  -d '{"enable_scheduler_bypass_lookup": true}'

# Step 4: Recreate lookup client on scheduler
curl -X POST "http://localhost:6999/lookup/recreate"

Example Response (scheduler):

{
  "old": "HitLimitLookupClient(LMCacheBypassLookupClient)",
  "new": "LMCacheLookupClient",
  "role": "scheduler"
}

Example Response (worker):

{
  "old": "LMCacheLookupServer",
  "new": "LMCacheAsyncLookupServer",
  "role": "worker"
}

Note

Client-only Changes

For some configuration changes (e.g., switching enable_scheduler_bypass_lookup), you only need to update the scheduler’s configuration and recreate its lookup client. Worker-side lookup servers don’t need to be recreated in this case.

`POST /chunk_statistics/start` — Start Statistics #

Start collecting chunk statistics.

Method: POST
Path: /chunk_statistics/start
Parameters: None
Response: application/json

curl -X POST http://localhost:7000/chunk_statistics/start

Example Response:

{
  "status": "success",
  "message": "Started"
}

Error Response (not supported, HTTP 400):

{
  "error": "Not available",
  "message": "Client does not support statistics."
}

`POST /chunk_statistics/stop` — Stop Statistics #

Stop collecting chunk statistics.

Method: POST
Path: /chunk_statistics/stop
Parameters: None
Response: application/json

curl -X POST http://localhost:7000/chunk_statistics/stop

Example Response:

{
  "status": "success",
  "message": "Stopped"
}

`POST /chunk_statistics/reset` — Reset Statistics #

Reset all collected chunk statistics.

Method: POST
Path: /chunk_statistics/reset
Parameters: None
Response: application/json

curl -X POST http://localhost:7000/chunk_statistics/reset

Example Response:

{
  "status": "success",
  "message": "Reset"
}

`GET /chunk_statistics/status` — Statistics Status #

Get current chunk statistics and auto-exit configuration.

Method: GET
Path: /chunk_statistics/status
Parameters: None
Response: application/json

curl http://localhost:7000/chunk_statistics/status

Example Response:

{
  "is_collecting": true,
  "total_chunks": 1000,
  "unique_chunks": 500,
  "timestamp": 1706745600.0,
  "auto_exit_enabled": true,
  "auto_exit_timeout_hours": 24.0,
  "auto_exit_target_unique_chunks": 1000
}

Bypass Mode #

Bypass mode allows dynamically skipping specific storage backends at runtime. Bypassed backends are excluded from contains/put/get operations. This is useful for fault injection testing, isolating a problematic backend, or debugging without restarting the engine.

`GET /bypass/list` — List Bypassed Backends #

List all currently bypassed backends and all available backend names.

Method: GET
Path: /bypass/list
Parameters: None
Response: application/json

curl http://localhost:7000/bypass/list

Example Response:

{
  "status": "success",
  "bypassed_backends": ["RemoteBackend"],
  "all_backends": ["LocalCPUBackend", "RemoteBackend"]
}

`PUT /bypass/add` — Add a Backend to Bypass List #

Add a backend to the bypass list. The bypassed backend will be excluded from contains/put/get operations.

Method: PUT
Path: /bypass/add
Parameters:

Name

Type

Description

backend_name

str

Name of the backend to bypass (required)
Response: application/json

curl -X PUT "http://localhost:7000/bypass/add?backend_name=RemoteBackend"

Example Response:

{
  "status": "success",
  "backend_name": "RemoteBackend",
  "bypassed": true,
  "was_already_bypassed": false,
  "bypassed_backends": ["RemoteBackend"]
}

Error Response (unknown backend, HTTP 400):

{
  "error": "Unknown backend",
  "message": "Backend 'FooBackend' not found. Available: ['LocalCPUBackend', 'RemoteBackend']"
}

`PUT /bypass/remove` — Remove a Backend from Bypass List #

Remove a backend from the bypass list, restoring it to normal operation.

Method: PUT
Path: /bypass/remove
Parameters:

Name

Type

Description

backend_name

str

Name of the backend to restore (required)
Response: application/json

curl -X PUT "http://localhost:7000/bypass/remove?backend_name=RemoteBackend"

Example Response:

{
  "status": "success",
  "backend_name": "RemoteBackend",
  "bypassed": false,
  "was_bypassed": true,
  "bypassed_backends": []
}

Error Response (unknown backend, HTTP 400):

{
  "error": "Unknown backend",
  "message": "Backend 'FooBackend' not found. Available: ['LocalCPUBackend', 'RemoteBackend']"
}