KV Cache Events#
KV cache events are actions or lifecycle events that occur when managing the KV cache during inference. These events can be used for KV-cache-aware routing.
LMCache supports KV cache events as follows:
Generates storage KV cache events
The events format is defined as per the BlockStored class in vLLM
LMCache passes the events to vLLM to publish them using its messaging system
Prerequisites#
The following prerequisites are required:
vLLM v0.13.0+ (as this is currently unreleased, you can use vLLM nightly build instead)
LMCache v0.3.10post2+
How to Generate KV Cache events#
Before starting to generate KV events, you need to be aware of the following:
You need to enable
enable_kv_eventsfor LMCache as events are not generated by default.If running more than 1 worker in vLLM, you need to use a non-default hashing algorithm (set
pre_caching_hash_algorithmin LMCache) so that hashes generated per worker are the same. If not then you will have duplicate events for the same operation as events are generated per worker.LMCache sends the events to vLLM for publishing. To enable events to be published, you need to set the vLLM configuration setting
--kv-events-config. See vLLM KV Events configuration for more details.
The steps that follow give an example of how KV events can be generated, published and consumed:
Start vLLM with LMCache and model
Qwen/Qwen3-0.6Bas follows:
LMCACHE_CONFIG_FILE=lmcache_config.yaml \
vllm serve Qwen/Qwen3-0.6B --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' \
--disable-log-requests --no-enable-prefix-caching --kv-events-config '{"enable_kv_cache_events": "True", "publisher": "zmq", "topic": "kv-events"}'
Example of the LMCache configuration is as follows:
chunk_size: 256
local_cpu: true
enable_kv_events: true
pre_caching_hash_algorithm: sha256_cbor_64bit
To be able to process the events that are published by vLLM, you need a client that subscribes to the publisher message channel and can consume the events. vLLM provides such a client example KV Events Subscriber. Run this python script in a separate terminal.
Prompt the model:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"prompt": "<|begin_of_text|><|system|>\nYou are a helpful AI assistant.\n<|user|>\nWhat is the capital of France?\n<|assistant|>",
"max_tokens": 100,
"temperature": 0.7
}'
You should receive a message in the client (that you started in step 2.) window, similar to the following:
Received event batch at 1765529395.2132685:
- BlockStored(block_hashes=[b'\x96\x95[h6\x1dE$v\x03\xe8\xf0\xc20\xcd\xe8\xa7#\x9cS\xe0\x16\xba\xab7\xf7z\x10P]\xfaT'], parent_block_hash=None, token_ids=[27, 91, 7265, 3575, 4326, 91, 1784, 91, 8948, 91, 397, 2610, 525, 264, 10950, 15235, 17847, 624, 27, 91, 872, 91, 397, 3838, 374, 279, 16158, 1685, 1370, 276, 5267, 27, 91, 77091, 91, 29], block_size=36, lora_id=None, medium='cpu')
This is the event generated after the cache store operation.