KV Cache Events#

KV cache events are actions or lifecycle events that occur when managing the KV cache during inference. These events can be used for KV-cache-aware routing.

LMCache supports KV cache events as follows:

  • Generates storage KV cache events

  • The events format is defined as per the BlockStored class in vLLM

  • LMCache passes the events to SGLang or vLLM to publish them using their messaging system

Prerequisites#

The following prerequisites are required:

  • vLLM v0.13.0+

  • LMCache v0.3.11+

  • SGLang vx.y.z+

  • LMCache vx.y.z+

How to Generate KV Cache events#

Before starting to generate KV events, you need to be aware of the following:

  • You need to enable enable_kv_events for LMCache as events are not generated by default.

  • If running more than 1 worker in vLLM, you need to use a non-default hashing algorithm (set pre_caching_hash_algorithm in LMCache) so that hashes generated per worker are the same. If not then you will have duplicate events for the same operation as events are generated per worker.

  • LMCache sends the events to vLLM for publishing. To enable events to be published, you need to set the vLLM configuration setting --kv-events-config. See vLLM KV Events configuration for more details.

The steps that follow give an example of how KV events can be generated, published and consumed:

  1. Start vLLM with LMCache and model Qwen/Qwen3-0.6B as follows:

LMCACHE_CONFIG_FILE=lmcache_config.yaml \
    vllm serve Qwen/Qwen3-0.6B --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' \
    --disable-log-requests --no-enable-prefix-caching --kv-events-config '{"enable_kv_cache_events": "True", "publisher": "zmq", "topic": "kv-events"}'

Example of the LMCache configuration is as follows:

chunk_size: 8  # demo only; use 256 for production
local_cpu: true
enable_kv_events: true
pre_caching_hash_algorithm: sha256_cbor_64bit
  1. To be able to process the events that are published by vLLM, you need a client that subscribes to the publisher message channel and can consume the events. vLLM provides such a client example KV Events Subscriber. Run this python script in a separate terminal.

  2. Prompt the model:

  curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "Qwen/Qwen3-0.6B",
  "prompt": "<|begin_of_text|><|system|>\nYou are a helpful AI assistant.\n<|user|>\nWhat is the capital of France?\n<|assistant|>",
  "max_tokens": 100,
  "temperature": 0.7
}'
  1. You should receive a message in the client (that you started in step 2.) window, similar to the following:

  Received event batch at 1765529395.2132685:
- BlockStored(block_hashes=[b'\x96\x95[h6\x1dE$v\x03\xe8\xf0\xc20\xcd\xe8\xa7#\x9cS\xe0\x16\xba\xab7\xf7z\x10P]\xfaT'], parent_block_hash=None, token_ids=[27, 91, 7265, 3575, 4326, 91, 1784, 91, 8948, 91, 397, 2610, 525, 264, 10950, 15235, 17847, 624, 27, 91, 872, 91, 397, 3838, 374, 279, 16158, 1685, 1370, 276, 5267, 27, 91, 77091, 91, 29], block_size=36, lora_id=None, medium='cpu')

This is the event generated after the cache store operation.

Before starting to generate KV events, you need to be aware of the following:

  • You need to enable enable_kv_events for LMCache as events are not generated by default.

  • LMCache sends the events to SGLang for publishing. To enable events to be published, you need to set the SGLang configuration setting --kv-events-config.

The steps that follow give an example of how KV events can be generated, published and consumed:

  1. Start SGLang with LMCache and model Qwen/Qwen3-0.6B as follows:

export LMCACHE_CONFIG_FILE=lmcache_config.yaml

python -m sglang.launch_server \
--model-path Qwen/Qwen3-0.6B \
--enable-lmcache \
--kv-events-config '{"publisher": "zmq", "topic": "kv-events"}'

Example of the LMCache configuration is as follows:

chunk_size: 8  # demo only; use 256 for production
local_cpu: true
use_layerwise: true
max_local_cpu_size: 10  # GB
enable_kv_events: true
  1. To be able to process the events that are published by SGLang, you need a client that subscribes to the publisher message channel and can consume the events. vLLM provides such a client example KV Events Subscriber. To use this client for SGLang, you need to remove the properties medium and lora_name from the BlockStored class definition and medium from BlockRemoved class definition. Save the changes and run this updated python script in a separate terminal.

  2. Prompt the model:

curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "Qwen/Qwen3-0.6B",
  "messages": [{"role": "user", "content": "Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts"}],
  "max_tokens": 100,
  "temperature": 0.7
}'
  1. You should receive a message in the client (that you started in step 2.) window, similar to the following:

Received event batch at 1769014811.9058058:
  - BlockStored(block_hashes=[-7651984371600085018], parent_block_hash=None, token_ids=[151644, 872, 198, 48, 16948, 18, 374, 279, 5535], block_size=8, lora_id=None)
  - BlockStored(block_hashes=[1717827842932260036], parent_block_hash=-7651984371600085018, token_ids=[5535, 9471, 315, 3460, 4128, 4119, 304, 1207, 16948], block_size=8, lora_id=None)
  - BlockStored(block_hashes=[-6563676647234339623], parent_block_hash=1717827842932260036, token_ids=[16948, 4013, 11, 10004, 264, 15817, 16182, 315, 27950], block_size=8, lora_id=None)
  - BlockStored(block_hashes=[-5164197595219155465], parent_block_hash=-6563676647234339623, token_ids=[27950, 323, 20980, 8668, 18376, 15546, 151645, 198, 151644], block_size=8, lora_id=None)
  - BlockStored(block_hashes=[8690007828157426740], parent_block_hash=-5164197595219155465, token_ids=[151644, 77091, 198, 151667, 198, 32313, 11, 279, 1196], block_size=8, lora_id=None)
  - BlockStored(block_hashes=[5720773965762948853], parent_block_hash=8690007828157426740, token_ids=[1196, 9733, 1207, 16948, 18, 438, 279, 5535, 9471], block_size=8, lora_id=None)
  - BlockStored(block_hashes=[-4465594513801548703], parent_block_hash=5720773965762948853, token_ids=[9471, 315, 3460, 4128, 4119, 304, 279, 1207, 16948], block_size=8, lora_id=None)
  - BlockStored(block_hashes=[4010782427232237897], parent_block_hash=-4465594513801548703, token_ids=[16948, 4013, 323, 429, 432, 5707, 264, 15817, 16182], block_size=8, lora_id=None)
  - BlockStored(block_hashes=[8472258105533326837], parent_block_hash=4010782427232237897, token_ids=[16182, 315, 27950, 323, 20980, 8668, 18376, 15546, 4119], block_size=8, lora_id=None)
  - BlockStored(block_hashes=[-3602322156693524155], parent_block_hash=8472258105533326837, token_ids=[4119, 13, 6771, 752, 1191, 553, 48996, 279, 1207], block_size=8, lora_id=None)
  - BlockStored(block_hashes=[-6413316389463734553], parent_block_hash=-3602322156693524155, token_ids=[1207, 16948, 4013, 13, 1207, 16948, 374, 264, 4013], block_size=8, lora_id=None)
  - BlockStored(block_hashes=[-4080340760183068020], parent_block_hash=-6413316389463734553, token_ids=[4013, 315, 15235, 4119, 7881, 553, 54364, 13, 576], block_size=8, lora_id=None)
  - BlockStored(block_hashes=[1557368444906237766], parent_block_hash=-4080340760183068020, token_ids=[576, 5535, 825, 11, 1207, 16948, 18, 11, 374], block_size=8, lora_id=None)
  - BlockStored(block_hashes=[-2282733302929094006], parent_block_hash=1557368444906237766, token_ids=[374, 12824, 279, 5535, 11, 773, 429, 594, 4396], block_size=8, lora_id=None)
  - BlockStored(block_hashes=[8695562889830890067], parent_block_hash=-2282733302929094006, token_ids=[4396, 382, 7039, 11, 279, 1196, 6801, 311, 1414], block_size=8, lora_id=None)
  - BlockStored(block_hashes=[-6034740625096789744], parent_block_hash=8695562889830890067, token_ids=[1414, 911, 279, 15817, 16182, 315, 27950, 323, 20980], block_size=8, lora_id=None)

This is the event generated after the cache store operation.