vLLM / Inference APIs#

These APIs are specific to vLLM inference workers and provide cache management, configuration, freeze control, chunk statistics, and version information.

Version & Info#

GET /lmc_version — LMCache Version#

Get the LMCache library version string.

  • Method: GET

  • Path: /lmc_version

  • Parameters: None

  • Response: Plain version string.

curl http://localhost:7000/lmc_version

GET /commit_id — LMCache Commit ID#

Get the LMCache git commit ID.

  • Method: GET

  • Path: /commit_id

  • Parameters: None

  • Response: Plain commit ID string.

curl http://localhost:7000/commit_id

GET /version — Full Version Info#

Get full version info (version + commit ID).

  • Method: GET

  • Path: /version

  • Parameters: None

  • Response: Combined version string.

curl http://localhost:7000/version

GET /inference_info — Inference Information#

Get inference information including vLLM config and LMCache details.

  • Method: GET

  • Path: /inference_info

  • Parameters:

    Name

    Type

    Description

    format

    str

    (Optional) Reserved for future use

  • Response: application/json

curl http://localhost:7000/inference_info

Error Response (HTTP 500):

{
  "error": "Failed to get inference info",
  "message": "..."
}

GET /inference_version — vLLM Version#

Get the vLLM version information.

  • Method: GET

  • Path: /inference_version

  • Parameters: None

  • Response: application/json

curl http://localhost:7000/inference_version

Example Response:

{
  "vllm_version": "0.8.0"
}

Configuration & Metadata#

GET /conf — Get Configuration#

Get current LMCache engine configuration values.

  • Method: GET

  • Path: /conf

  • Parameters:

    Name

    Type

    Description

    names

    str

    (Optional) Comma-separated list of config names to filter

  • Response: application/json — JSON object of configuration key-value pairs.

# Get all config
curl http://localhost:7000/conf

# Get specific config values
curl "http://localhost:7000/conf?names=min_retrieve_tokens,save_decode_cache"

POST /conf — Update Configuration (Experimental)#

Update one or more configuration values at runtime.

Warning

This feature is currently experimental. All configuration keys are mutable at runtime by default unless explicitly marked as "mutable": False in _CONFIG_DEFINITIONS. The default will be changed to immutable once the feature is stabilized.

Updating a configuration only modifies the value in the LMCacheEngineConfig object. If a component has already cached the value elsewhere, the change will not take effect for that component.

  • Method: POST

  • Path: /conf

  • Content-Type: application/json

  • Request Body: JSON object with config name-value pairs.

  • Response: application/json

curl -X POST http://localhost:7000/conf \
  -H "Content-Type: application/json" \
  -d '{"min_retrieve_tokens": 512, "save_decode_cache": true}'

Example Response (HTTP 200):

{
  "updated": {
    "min_retrieve_tokens": 512,
    "save_decode_cache": true
  }
}

Example Response (partial failure, HTTP 400):

{
  "updated": {"min_retrieve_tokens": 512},
  "errors": {"unknown_key": "Unknown config"}
}

Error Cases:

  • Unknown config key → "Unknown config"

  • Immutable config key → "Config is not mutable at runtime"

  • Invalid JSON body → HTTP 400

GET /meta — Engine Metadata#

Get metadata of the LMCache engine (e.g., worker_id, model_name, kv_shape).

  • Method: GET

  • Path: /meta

  • Parameters:

    Name

    Type

    Description

    names

    str

    (Optional) Comma-separated list of attribute names to filter

  • Response: application/json — JSON object of metadata attributes.

# Get all metadata
curl http://localhost:7000/meta

# Get specific attributes
curl "http://localhost:7000/meta?names=worker_id,model_name"

Cache Operations#

DELETE /cache/clear — Clear Cache#

Clear cached KV data from the LMCache engine.

  • Method: DELETE

  • Path: /cache/clear

  • Parameters:

    Name

    Type

    Description

    locations

    list[str]

    (Optional) Storage backends to clear (e.g. LocalCPUBackend, LocalDiskBackend). If not specified, clears all.

  • Response: application/json

# Clear all cache
curl -X DELETE http://localhost:7000/cache/clear

# Clear specific backends
curl -X DELETE "http://localhost:7000/cache/clear?locations=LocalCPUBackend&locations=LocalDiskBackend"

Example Response:

{
  "status": "success",
  "num_removed": 10
}

POST /cache/store — Store KV Cache#

Store KV cache data into the LMCache engine using mock tokens.

  • Method: POST

  • Path: /cache/store

  • Parameters:

    Name

    Type

    Description

    tokens_mock

    str

    Two comma-separated numbers: "start,end" (e.g. "0,100" generates tokens [0..99])

  • Response: application/json

curl -X POST "http://localhost:7000/cache/store?tokens_mock=0,100"

Example Response:

{
  "status": "success",
  "num_tokens": 100
}

Error Response (missing params, HTTP 400):

{
  "error": "Missing parameters",
  "message": "Must specify either tokens_input or tokens_mock"
}

POST /cache/retrieve — Retrieve KV Cache#

Retrieve KV cache data from the LMCache engine using mock tokens.

  • Method: POST

  • Path: /cache/retrieve

  • Parameters:

    Name

    Type

    Description

    tokens_mock

    str

    Two comma-separated numbers: "start,end" (e.g. "0,100")

  • Response: application/json

curl -X POST "http://localhost:7000/cache/retrieve?tokens_mock=0,100"

Example Response:

{
  "status": "success",
  "num_tokens": 100,
  "num_retrieved": 80
}

GET /cache/kvcache/check — KVCache Checksum#

Compute MD5 checksums for kvcaches at specified slot_mapping positions. Used for verifying that stored and retrieved kvcaches are identical.

  • Method: GET

  • Path: /cache/kvcache/check

  • Parameters:

    Name

    Type

    Description

    slot_mapping

    str

    Slot indices, comma-separated. Supports ranges: "0,1,2,3" or "1,2,3,[9,12],17,19"

    chunk_size

    int

    Chunk size for computing per-chunk checksums (required)

    layerwise

    bool

    If true, output per-layer checksums per chunk (default: false)

  • Response: application/json

# Per-chunk checksum (all layers combined)
curl "http://localhost:7000/cache/kvcache/check?slot_mapping=0,1,2,3&chunk_size=2"

# Per-layer per-chunk checksum
curl "http://localhost:7000/cache/kvcache/check?slot_mapping=0,1,2,3&chunk_size=2&layerwise=true"

Example Response (layerwise=false):

{
  "status": "success",
  "slot_mapping_ranges": [[0, 3]],
  "chunk_size": 2,
  "num_chunks": 2,
  "chunk_checksums": ["abc123...", "def456..."],
  "layerwise": false
}

Example Response (layerwise=true):

{
  "status": "success",
  "slot_mapping_ranges": [[0, 3]],
  "chunk_size": 2,
  "num_chunks": 2,
  "chunk_checksums": {
    "layer_0": ["abc123...", "def456..."],
    "layer_1": ["ghi789...", "jkl012..."]
  },
  "layerwise": true
}

POST /cache/kvcache/record_slot — Toggle Slot Logging#

Enable or disable KVCache slot_mapping logging during store/retrieve operations.

  • Method: POST

  • Path: /cache/kvcache/record_slot

  • Parameters:

    Name

    Type

    Description

    enabled

    str

    "true" to enable, "false" to disable. Omit to query current status.

  • Response: application/json

# Enable logging
curl -X POST "http://localhost:7000/cache/kvcache/record_slot?enabled=true"

# Disable logging
curl -X POST "http://localhost:7000/cache/kvcache/record_slot?enabled=false"

# Check current status
curl -X POST http://localhost:7000/cache/kvcache/record_slot

Example Response:

{
  "status": "success",
  "kvcache_check_log_enabled": true
}

GET /cache/kvcache/info — KVCache Information#

Get information about the current kvcaches structure including layer names, shapes, and device info.

  • Method: GET

  • Path: /cache/kvcache/info

  • Parameters: None

  • Response: application/json

curl http://localhost:7000/cache/kvcache/info

Example Response:

{
  "status": "success",
  "num_layers": 32,
  "layers": {
    "layer_0": {
      "shape": [2, 128, 16, 64, 128],
      "dtype": "torch.bfloat16",
      "device": "cuda:0"
    }
  }
}

POST /cache/load-fs-chunks — Load FS Chunks#

Load chunk files from FSConnector storage into LocalCPUBackend’s hot cache.

  • Method: POST

  • Path: /cache/load-fs-chunks

  • Content-Type: application/json

  • Request Body:

    Field

    Type

    Description

    config_path

    str

    Path to LMCache engine configuration YAML file (required)

    max_chunks

    int

    (Optional) Maximum number of chunks to load

    max_failed_keys

    int

    Maximum failed keys to report (default: 10)

  • Response: application/json

  • Tags: cache-management

curl -X POST http://localhost:7000/cache/load-fs-chunks \
  -H "Content-Type: application/json" \
  -d '{"config_path": "/path/to/lmcache.yaml", "max_chunks": 100}'

Example Response (HTTP 200):

{
  "status": "success",
  "loaded_chunks": 95,
  "total_files": 100,
  "failed_keys": ["key1", "key2"],
  "config_path": "/path/to/lmcache.yaml"
}

Error Response (invalid config, HTTP 400):

{
  "error": "Failed to load chunks from FSConnector",
  "message": "Configuration file not found",
  "config_path": "/path/to/lmcache.yaml"
}

Freeze Mode#

PUT /freeze/enable — Enable Freeze Mode#

Enable freeze mode for the LMCache engine. When enabled:

  • All store operations will be skipped (no new data stored)

  • Only local_cpu backend will be used for retrieval

  • No admit/evict messages will be generated

This protects the local_cpu hot cache from changes.

  • Method: PUT

  • Path: /freeze/enable

  • Parameters: None

  • Response: application/json

curl -X PUT http://localhost:7000/freeze/enable

Example Response:

{
  "status": "success",
  "freeze": true,
  "message": "Freeze mode enabled successfully"
}

PUT /freeze/disable — Disable Freeze Mode#

Disable freeze mode. Store operations will proceed normally.

  • Method: PUT

  • Path: /freeze/disable

  • Parameters: None

  • Response: application/json

curl -X PUT http://localhost:7000/freeze/disable

Example Response:

{
  "status": "success",
  "freeze": false,
  "message": "Freeze mode disabled successfully"
}

GET /freeze/status — Freeze Status#

Get the current freeze mode status.

  • Method: GET

  • Path: /freeze/status

  • Parameters: None

  • Response: application/json

curl http://localhost:7000/freeze/status

Example Response:

{
  "status": "success",
  "freeze": true,
  "message": "Freeze mode is enabled"
}

Chunk Statistics#

These endpoints manage chunk-level statistics collection via ChunkStatisticsLookupClient. They are only available when the lookup client supports statistics.

POST /chunk_statistics/start — Start Statistics#

Start collecting chunk statistics.

  • Method: POST

  • Path: /chunk_statistics/start

  • Parameters: None

  • Response: application/json

curl -X POST http://localhost:7000/chunk_statistics/start

Example Response:

{
  "status": "success",
  "message": "Started"
}

Error Response (not supported, HTTP 400):

{
  "error": "Not available",
  "message": "Client does not support statistics."
}

POST /chunk_statistics/stop — Stop Statistics#

Stop collecting chunk statistics.

  • Method: POST

  • Path: /chunk_statistics/stop

  • Parameters: None

  • Response: application/json

curl -X POST http://localhost:7000/chunk_statistics/stop

Example Response:

{
  "status": "success",
  "message": "Stopped"
}

POST /chunk_statistics/reset — Reset Statistics#

Reset all collected chunk statistics.

  • Method: POST

  • Path: /chunk_statistics/reset

  • Parameters: None

  • Response: application/json

curl -X POST http://localhost:7000/chunk_statistics/reset

Example Response:

{
  "status": "success",
  "message": "Reset"
}

GET /chunk_statistics/status — Statistics Status#

Get current chunk statistics and auto-exit configuration.

  • Method: GET

  • Path: /chunk_statistics/status

  • Parameters: None

  • Response: application/json

curl http://localhost:7000/chunk_statistics/status

Example Response:

{
  "is_collecting": true,
  "total_chunks": 1000,
  "unique_chunks": 500,
  "timestamp": 1706745600.0,
  "auto_exit_enabled": true,
  "auto_exit_timeout_hours": 24.0,
  "auto_exit_target_unique_chunks": 1000
}