KV Cache Events#

KV cache events are actions or lifecycle events that occur when managing the KV cache during inference. These events can be used for KV-cache-aware routing.

LMCache supports KV cache events as follows:

  • Generates storage KV cache events

  • The events format is defined as per the BlockStored class in vLLM

  • LMCache passes the events to vLLM to publish them using its messaging system

Prerequisites#

The following prerequisites are required:

  • vLLM v0.13.0+ (as this is currently unreleased, you can use vLLM nightly build instead)

  • LMCache v0.3.10post2+

How to Generate KV Cache events#

Before starting to generate KV events, you need to be aware of the following:

  • You need to enable enable_kv_events for LMCache as events are not generated by default.

  • If running more than 1 worker in vLLM, you need to use a non-default hashing algorithm (set pre_caching_hash_algorithm in LMCache) so that hashes generated per worker are the same. If not then you will have duplicate events for the same operation as events are generated per worker.

  • LMCache sends the events to vLLM for publishing. To enable events to be published, you need to set the vLLM configuration setting --kv-events-config. See vLLM KV Events configuration for more details.

The steps that follow give an example of how KV events can be generated, published and consumed:

  1. Start vLLM with LMCache and model Qwen/Qwen3-0.6B as follows:

LMCACHE_CONFIG_FILE=lmcache_config.yaml \
    vllm serve Qwen/Qwen3-0.6B --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' \
    --disable-log-requests --no-enable-prefix-caching --kv-events-config '{"enable_kv_cache_events": "True", "publisher": "zmq", "topic": "kv-events"}'

Example of the LMCache configuration is as follows:

chunk_size: 256
local_cpu: true
enable_kv_events: true
pre_caching_hash_algorithm: sha256_cbor_64bit
  1. To be able to process the events that are published by vLLM, you need a client that subscribes to the publisher message channel and can consume the events. vLLM provides such a client example KV Events Subscriber. Run this python script in a separate terminal.

  2. Prompt the model:

  curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "Qwen/Qwen3-0.6B",
  "prompt": "<|begin_of_text|><|system|>\nYou are a helpful AI assistant.\n<|user|>\nWhat is the capital of France?\n<|assistant|>",
  "max_tokens": 100,
  "temperature": 0.7
}'
  1. You should receive a message in the client (that you started in step 2.) window, similar to the following:

  Received event batch at 1765529395.2132685:
- BlockStored(block_hashes=[b'\x96\x95[h6\x1dE$v\x03\xe8\xf0\xc20\xcd\xe8\xa7#\x9cS\xe0\x16\xba\xab7\xf7z\x10P]\xfaT'], parent_block_hash=None, token_ids=[27, 91, 7265, 3575, 4326, 91, 1784, 91, 8948, 91, 397, 2610, 525, 264, 10950, 15235, 17847, 624, 27, 91, 872, 91, 397, 3838, 374, 279, 16158, 1685, 1370, 276, 5267, 27, 91, 77091, 91, 29], block_size=36, lora_id=None, medium='cpu')

This is the event generated after the cache store operation.