Hybrid Attention Models#
Some models interleave more than one attention type across their layers — most commonly sliding-window attention on most layers and full attention on a few. vLLM serves these with its hybrid KV cache manager, which splits the model’s layers into multiple KV cache groups (one per attention behavior).
The LMCache multiprocess connector (LMCacheMPConnector) supports these
hybrid models: it stores and retrieves the KV cache for every group, so prefix
caching and KV reuse work the same way they do for plain models.
Validated hybrid models#
Recipe pages for the validated hybrid-attention architectures:
Model |
Attention layout |
Recipe |
|---|---|---|
Gemma 3 |
Sliding-window + full |
|
Gemma 4 |
Sliding-window + full |
|
gpt-oss |
Sliding-window + full |
|
Qwen3.5 / Qwen3.6 series |
Mamba / GDN + full |
|
DeepSeek-V4-Flash |
Sparse-MLA (multiple KV groups) |
|
GLM 5.1/5.2 |
Dynamic Sparse Attention (multiple KV groups) |
What Works#
Models whose layers all use standard paged attention — including hybrids that mix sliding-window and full attention — are supported with no special configuration. Examples:
Model family |
Attention layout |
Status |
|---|---|---|
Gemma 2 / Gemma 3 |
Interleaved sliding-window + full |
Supported |
gpt-oss |
Interleaved sliding-window + full |
Supported |
Qwen3.5 (and other Gated-DeltaNet hybrids) |
Interleaved Mamba/GDN + full |
Supported (see below) |
Llama, Qwen2/Qwen3 (dense), Mistral, … |
Single attention type |
Supported |
Just point vLLM at the LMCache server as usual (see Quickstart); LMCache detects the model’s KV cache groups automatically at registration time.
Note
Because LMCacheMPConnector advertises hybrid support to vLLM, vLLM keeps
its hybrid KV cache manager enabled for these models (it does not fall
back to a single unified group). You do not need
--no-disable-hybrid-kv-cache-manager or any related flag.
Mamba / Linear-Attention Hybrids#
Models that interleave Mamba / Gated-DeltaNet (GDN) linear-attention layers
with full attention — the Qwen3.5 and Qwen3.6 series (Qwen/Qwen3.5-0.8B,
Qwen/Qwen3.6-27B, …), Qwen3-Next, and other GDN hybrids — are supported.
Unlike a paged key/value cache, their linear-attention layers keep a recurrent
state cache (a convolution + SSM state). LMCache reinterprets that state as
an opaque page at registration time, so prefix caching and KV reuse work end to
end without any model-specific transfer code.
This section is the general procedure for any such model. The only
per-model variable is the unified block size N (step 1); everything else
is identical across models.
Step 1 — find the model’s unified block size N#
N is the single number that drives every other setting: the LMCache
server’s --chunk-size and vLLM’s --max-num-batched-tokens are both
derived from it (step 2). Get it wrong and LMCache raises at engine startup.
For a Mamba / GDN hybrid, vLLM forces one block size across all KV cache groups, chosen large enough that an attention page is at least as big as a Mamba state page. It depends on the model’s head dimensions and GDN state size, so it is model-specific — never assume a value, read it from the model. vLLM prints it once at startup:
INFO ... interface.py:670] Setting attention block size to 784 tokens to
ensure that attention page size is >= mamba page size.
You do not need LMCache, a full serving run, or the weights to be quantized to
read it — just launch vLLM until the line appears, then stop. The snippet below
does exactly that and prints N:
MODEL=Qwen/Qwen3.6-27B
LOG=$(mktemp)
# Launch vLLM just far enough to size the KV cache; cheap settings only.
vllm serve "$MODEL" \
--mamba-cache-mode align --enable-prefix-caching \
--max-model-len 8192 --gpu-memory-utilization 0.5 \
--port 8011 > "$LOG" 2>&1 &
VLLM_PID=$!
# Wait for the block-size line (or a fatal error), then stop vLLM.
until grep -qiE "Setting attention block size|Error|Traceback" "$LOG"; do
sleep 3
done
grep -i "Setting attention block size" "$LOG"
kill "$VLLM_PID"
The number in to N tokens is your N. Values grow with model size; for
example:
Model |
Unified block size |
GPUs |
|---|---|---|
|
784 |
1 |
|
544 |
1 |
Step 2 — derive the three required flags from N#
LMCache server
--chunk-size= N (or any multiple ofN). This is the rule the connector enforces: LMCache’s chunk size must be a multiple of vLLM’s unified block size, or registration fails:lmcache server --chunk-size 784 --l1-size-gb 100 --eviction-policy LRUvLLM
--max-num-batched-tokensin [N, 2·N) — setting it equal toNis the simple, always-valid choice. Outside this range LMCache raises at engine startup.alignmode snapshots the Mamba state only at the end of each scheduler step, so each prefill step must advance exactly one block; a larger budget would let a step skip block boundaries, leaving no snapshot for LMCache to store at those prefixes.vLLM
--mamba-cache-mode align --enable-prefix-caching—alignis mandatory (GDN backends do not support theallmode):vllm serve <model> \ --enable-prefix-caching --mamba-cache-mode align \ --max-num-batched-tokens 784 \ --kv-transfer-config \ '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'
So for a freshly-probed model the whole derivation is just: read N (step 1),
then pass --chunk-size N to the server and --max-num-batched-tokens N to
vLLM.
No --no-disable-hybrid-kv-cache-manager or attention-backend flag is needed;
LMCacheMPConnector advertises hybrid support and vLLM auto-selects the GDN
backend.
Caveats#
Generation is not bit-exact between a cached and a fresh run: GDN backends do not support vLLM’s batch-invariant mode. Validate with a score-level comparison (see Verifying Correctness), not a token-level diff.
The cached pages are byte-opaque, so content-aware features (CacheGen compression, CacheBlend) do not apply, and cache entries must not be shared across engines with different attention backends or kernel block sizes.
Several of these models are vision-language (they load a vision tower). The validated, supported path is text KV caching; image/video KV caching is not validated.
vLLM’s Mamba prefix caching in
alignmode is marked experimental upstream.
See the Qwen3.5 / Qwen3.6 recipe for the validated end-to-end commands and the per-model block sizes.
What Is Not Supported Yet#
DeepSeek-V4-style compressed / indexer caches are not yet handled by the multiprocess connector.
Verifying Correctness#
To convince yourself that a hybrid model’s KV is being cached and reused correctly, you can compare a cold run against a run served from LMCache:
Run an evaluation (e.g.
lm_evalongsm8k) against vLLM + LMCache. This computes the KV cache and stores it in LMCache.Reset only vLLM’s local prefix cache, leaving the LMCache-managed cache intact (requires launching vLLM with
VLLM_SERVER_DEV_MODE=1):curl -X POST http://localhost:8000/reset_prefix_cacheOmit the
reset_external=truequery parameter so the LMCache cache is preserved.Re-run the same evaluation. vLLM now misses in its local cache, so the prefix KV is retrieved from LMCache. The score should match the first run.
The project ships this as the hma_lm_eval continuous-integration test (see
.buildkite/k3_tests/multiprocess).
See Also#
Quickstart — launching the LMCache server and a vLLM client.
Design notes on how groups are detected and addressed:
docs/design/integration/vllm/hybrid-kv-cache-groups.mdin the source tree.