KV Cache Events#
KV cache events are actions or lifecycle events that occur when managing the KV cache during inference. These events can be used for KV-cache-aware routing.
LMCache supports KV cache events as follows:
Generates storage KV cache events
The events format is defined as per the BlockStored class in vLLM
LMCache passes the events to SGLang or vLLM to publish them using their messaging system
Prerequisites#
The following prerequisites are required:
vLLM v0.13.0+
LMCache v0.3.11+
SGLang vx.y.z+
LMCache vx.y.z+
How to Generate KV Cache events#
Before starting to generate KV events, you need to be aware of the following:
You need to enable
enable_kv_eventsfor LMCache as events are not generated by default.If running more than 1 worker in vLLM, you need to use a non-default hashing algorithm (set
pre_caching_hash_algorithmin LMCache) so that hashes generated per worker are the same. If not then you will have duplicate events for the same operation as events are generated per worker.LMCache sends the events to vLLM for publishing. To enable events to be published, you need to set the vLLM configuration setting
--kv-events-config. See vLLM KV Events configuration for more details.
The steps that follow give an example of how KV events can be generated, published and consumed:
Start vLLM with LMCache and model
Qwen/Qwen3-0.6Bas follows:
LMCACHE_CONFIG_FILE=lmcache_config.yaml \
vllm serve Qwen/Qwen3-0.6B --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' \
--disable-log-requests --no-enable-prefix-caching --kv-events-config '{"enable_kv_cache_events": "True", "publisher": "zmq", "topic": "kv-events"}'
Example of the LMCache configuration is as follows:
chunk_size: 8 # demo only; use 256 for production
local_cpu: true
enable_kv_events: true
pre_caching_hash_algorithm: sha256_cbor_64bit
To be able to process the events that are published by vLLM, you need a client that subscribes to the publisher message channel and can consume the events. vLLM provides such a client example KV Events Subscriber. Run this python script in a separate terminal.
Prompt the model:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"prompt": "<|begin_of_text|><|system|>\nYou are a helpful AI assistant.\n<|user|>\nWhat is the capital of France?\n<|assistant|>",
"max_tokens": 100,
"temperature": 0.7
}'
You should receive a message in the client (that you started in step 2.) window, similar to the following:
Received event batch at 1765529395.2132685:
- BlockStored(block_hashes=[b'\x96\x95[h6\x1dE$v\x03\xe8\xf0\xc20\xcd\xe8\xa7#\x9cS\xe0\x16\xba\xab7\xf7z\x10P]\xfaT'], parent_block_hash=None, token_ids=[27, 91, 7265, 3575, 4326, 91, 1784, 91, 8948, 91, 397, 2610, 525, 264, 10950, 15235, 17847, 624, 27, 91, 872, 91, 397, 3838, 374, 279, 16158, 1685, 1370, 276, 5267, 27, 91, 77091, 91, 29], block_size=36, lora_id=None, medium='cpu')
This is the event generated after the cache store operation.
Before starting to generate KV events, you need to be aware of the following:
You need to enable
enable_kv_eventsfor LMCache as events are not generated by default.LMCache sends the events to SGLang for publishing. To enable events to be published, you need to set the SGLang configuration setting
--kv-events-config.
The steps that follow give an example of how KV events can be generated, published and consumed:
Start SGLang with LMCache and model
Qwen/Qwen3-0.6Bas follows:
export LMCACHE_CONFIG_FILE=lmcache_config.yaml
python -m sglang.launch_server \
--model-path Qwen/Qwen3-0.6B \
--enable-lmcache \
--kv-events-config '{"publisher": "zmq", "topic": "kv-events"}'
Example of the LMCache configuration is as follows:
chunk_size: 8 # demo only; use 256 for production
local_cpu: true
use_layerwise: true
max_local_cpu_size: 10 # GB
enable_kv_events: true
To be able to process the events that are published by SGLang, you need a client that subscribes to the publisher message channel and can consume the events. vLLM provides such a client example KV Events Subscriber. To use this client for SGLang, you need to remove the properties
mediumandlora_namefrom theBlockStoredclass definition andmediumfromBlockRemovedclass definition. Save the changes and run this updated python script in a separate terminal.Prompt the model:
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts"}],
"max_tokens": 100,
"temperature": 0.7
}'
You should receive a message in the client (that you started in step 2.) window, similar to the following:
Received event batch at 1769014811.9058058:
- BlockStored(block_hashes=[-7651984371600085018], parent_block_hash=None, token_ids=[151644, 872, 198, 48, 16948, 18, 374, 279, 5535], block_size=8, lora_id=None)
- BlockStored(block_hashes=[1717827842932260036], parent_block_hash=-7651984371600085018, token_ids=[5535, 9471, 315, 3460, 4128, 4119, 304, 1207, 16948], block_size=8, lora_id=None)
- BlockStored(block_hashes=[-6563676647234339623], parent_block_hash=1717827842932260036, token_ids=[16948, 4013, 11, 10004, 264, 15817, 16182, 315, 27950], block_size=8, lora_id=None)
- BlockStored(block_hashes=[-5164197595219155465], parent_block_hash=-6563676647234339623, token_ids=[27950, 323, 20980, 8668, 18376, 15546, 151645, 198, 151644], block_size=8, lora_id=None)
- BlockStored(block_hashes=[8690007828157426740], parent_block_hash=-5164197595219155465, token_ids=[151644, 77091, 198, 151667, 198, 32313, 11, 279, 1196], block_size=8, lora_id=None)
- BlockStored(block_hashes=[5720773965762948853], parent_block_hash=8690007828157426740, token_ids=[1196, 9733, 1207, 16948, 18, 438, 279, 5535, 9471], block_size=8, lora_id=None)
- BlockStored(block_hashes=[-4465594513801548703], parent_block_hash=5720773965762948853, token_ids=[9471, 315, 3460, 4128, 4119, 304, 279, 1207, 16948], block_size=8, lora_id=None)
- BlockStored(block_hashes=[4010782427232237897], parent_block_hash=-4465594513801548703, token_ids=[16948, 4013, 323, 429, 432, 5707, 264, 15817, 16182], block_size=8, lora_id=None)
- BlockStored(block_hashes=[8472258105533326837], parent_block_hash=4010782427232237897, token_ids=[16182, 315, 27950, 323, 20980, 8668, 18376, 15546, 4119], block_size=8, lora_id=None)
- BlockStored(block_hashes=[-3602322156693524155], parent_block_hash=8472258105533326837, token_ids=[4119, 13, 6771, 752, 1191, 553, 48996, 279, 1207], block_size=8, lora_id=None)
- BlockStored(block_hashes=[-6413316389463734553], parent_block_hash=-3602322156693524155, token_ids=[1207, 16948, 4013, 13, 1207, 16948, 374, 264, 4013], block_size=8, lora_id=None)
- BlockStored(block_hashes=[-4080340760183068020], parent_block_hash=-6413316389463734553, token_ids=[4013, 315, 15235, 4119, 7881, 553, 54364, 13, 576], block_size=8, lora_id=None)
- BlockStored(block_hashes=[1557368444906237766], parent_block_hash=-4080340760183068020, token_ids=[576, 5535, 825, 11, 1207, 16948, 18, 11, 374], block_size=8, lora_id=None)
- BlockStored(block_hashes=[-2282733302929094006], parent_block_hash=1557368444906237766, token_ids=[374, 12824, 279, 5535, 11, 773, 429, 594, 4396], block_size=8, lora_id=None)
- BlockStored(block_hashes=[8695562889830890067], parent_block_hash=-2282733302929094006, token_ids=[4396, 382, 7039, 11, 279, 1196, 6801, 311, 1414], block_size=8, lora_id=None)
- BlockStored(block_hashes=[-6034740625096789744], parent_block_hash=8695562889830890067, token_ids=[1414, 911, 279, 15817, 16182, 315, 27950, 323, 20980], block_size=8, lora_id=None)
This is the event generated after the cache store operation.