Hybrid-Attention Models#
Some models interleave more than one attention type across their layers — most commonly sliding-window attention on most layers and full attention on a few. vLLM serves these with its hybrid KV cache manager, which splits the model’s layers into multiple KV cache groups (one per attention behavior).
The LMCache multiprocess connector (LMCacheMPConnector) supports these
hybrid models: it stores and retrieves the KV cache for every group, so prefix
caching and KV reuse work the same way they do for plain models.
What Works#
Models whose layers all use standard paged attention — including hybrids that mix sliding-window and full attention — are supported with no special configuration. Examples:
Model family |
Attention layout |
Status |
|---|---|---|
Gemma 2 / Gemma 3 |
Interleaved sliding-window + full |
Supported |
gpt-oss |
Interleaved sliding-window + full |
Supported |
Llama, Qwen2/Qwen3 (dense), Mistral, … |
Single attention type |
Supported |
Just point vLLM at the LMCache server as usual (see Quick Start); LMCache detects the model’s KV cache groups automatically at registration time.
Note
Because LMCacheMPConnector advertises hybrid support to vLLM, vLLM keeps
its hybrid KV cache manager enabled for these models (it does not fall
back to a single unified group). You do not need
--no-disable-hybrid-kv-cache-manager or any related flag.
What Is Not Supported Yet#
Mamba / linear-attention hybrids (e.g. Qwen3-Next, Qwen3.5, and other Gated-DeltaNet models). These layers keep a recurrent state cache (a convolution + SSM state) instead of a paged key/value cache, which LMCache’s transfer path cannot represent today. Such models will fail to register with the LMCache server. Tracking support is future work.
DeepSeek-V4-style compressed / indexer caches are likewise not yet handled by the multiprocess connector.
Verifying Correctness#
To convince yourself that a hybrid model’s KV is being cached and reused correctly, you can compare a cold run against a run served from LMCache:
Run an evaluation (e.g.
lm_evalongsm8k) against vLLM + LMCache. This computes the KV cache and stores it in LMCache.Reset only vLLM’s local prefix cache, leaving the LMCache-managed cache intact (requires launching vLLM with
VLLM_SERVER_DEV_MODE=1):curl -X POST http://localhost:8000/reset_prefix_cacheOmit the
reset_external=truequery parameter so the LMCache cache is preserved.Re-run the same evaluation. vLLM now misses in its local cache, so the prefix KV is retrieved from LMCache. The score should match the first run.
The project ships this as the hma_lm_eval continuous-integration test (see
.buildkite/k3_tests/multiprocess).
See Also#
Quick Start — launching the LMCache server and a vLLM client.
Design notes on how groups are detected and addressed:
docs/design/integration/vllm/hybrid-kv-cache-groups.mdin the source tree.