Hybrid Attention Models#

Some models interleave more than one attention type across their layers — most commonly sliding-window attention on most layers and full attention on a few. vLLM serves these with its hybrid KV cache manager, which splits the model’s layers into multiple KV cache groups (one per attention behavior).

The LMCache multiprocess connector (LMCacheMPConnector) supports these hybrid models: it stores and retrieves the KV cache for every group, so prefix caching and KV reuse work the same way they do for plain models.

Validated hybrid models#

Recipe pages for the validated hybrid-attention architectures:

Model

Attention layout

Recipe

Gemma 3

Sliding-window + full

Gemma 3

Gemma 4

Sliding-window + full

Gemma 4

gpt-oss

Sliding-window + full

gpt-oss

Qwen3.5 / Qwen3.6 series

Mamba / GDN + full

Qwen3.5 / Qwen3.6 series

DeepSeek-V4-Flash

Sparse-MLA (multiple KV groups)

DeepSeek-V4-Flash

GLM 5.1/5.2

Dynamic Sparse Attention (multiple KV groups)

GLM 5.1/5.2

What Works#

Models whose layers all use standard paged attention — including hybrids that mix sliding-window and full attention — are supported with no special configuration. Examples:

Model family

Attention layout

Status

Gemma 2 / Gemma 3

Interleaved sliding-window + full

Supported

gpt-oss

Interleaved sliding-window + full

Supported

Qwen3.5 (and other Gated-DeltaNet hybrids)

Interleaved Mamba/GDN + full

Supported (see below)

Llama, Qwen2/Qwen3 (dense), Mistral, …

Single attention type

Supported

Just point vLLM at the LMCache server as usual (see Quickstart); LMCache detects the model’s KV cache groups automatically at registration time.

Note

Because LMCacheMPConnector advertises hybrid support to vLLM, vLLM keeps its hybrid KV cache manager enabled for these models (it does not fall back to a single unified group). You do not need --no-disable-hybrid-kv-cache-manager or any related flag.

Mamba / Linear-Attention Hybrids#

Models that interleave Mamba / Gated-DeltaNet (GDN) linear-attention layers with full attention — the Qwen3.5 and Qwen3.6 series (Qwen/Qwen3.5-0.8B, Qwen/Qwen3.6-27B, …), Qwen3-Next, and other GDN hybrids — are supported. Unlike a paged key/value cache, their linear-attention layers keep a recurrent state cache (a convolution + SSM state). LMCache reinterprets that state as an opaque page at registration time, so prefix caching and KV reuse work end to end without any model-specific transfer code.

This section is the general procedure for any such model. The only per-model variable is the unified block size N (step 1); everything else is identical across models.

Step 1 — find the model’s unified block size N#

N is the single number that drives every other setting: the LMCache server’s --chunk-size and vLLM’s --max-num-batched-tokens are both derived from it (step 2). Get it wrong and LMCache raises at engine startup.

For a Mamba / GDN hybrid, vLLM forces one block size across all KV cache groups, chosen large enough that an attention page is at least as big as a Mamba state page. It depends on the model’s head dimensions and GDN state size, so it is model-specific — never assume a value, read it from the model. vLLM prints it once at startup:

INFO ... interface.py:670] Setting attention block size to 784 tokens to
ensure that attention page size is >= mamba page size.

You do not need LMCache, a full serving run, or the weights to be quantized to read it — just launch vLLM until the line appears, then stop. The snippet below does exactly that and prints N:

MODEL=Qwen/Qwen3.6-27B
LOG=$(mktemp)

# Launch vLLM just far enough to size the KV cache; cheap settings only.
vllm serve "$MODEL" \
    --mamba-cache-mode align --enable-prefix-caching \
    --max-model-len 8192 --gpu-memory-utilization 0.5 \
    --port 8011 > "$LOG" 2>&1 &
VLLM_PID=$!

# Wait for the block-size line (or a fatal error), then stop vLLM.
until grep -qiE "Setting attention block size|Error|Traceback" "$LOG"; do
    sleep 3
done
grep -i "Setting attention block size" "$LOG"
kill "$VLLM_PID"

The number in to N tokens is your N. Values grow with model size; for example:

Model

Unified block size N

GPUs

Qwen/Qwen3.6-27B

784

1

Qwen/Qwen3.5-0.8B

544

1

Step 2 — derive the three required flags from N#

  1. LMCache server --chunk-size = N (or any multiple of N). This is the rule the connector enforces: LMCache’s chunk size must be a multiple of vLLM’s unified block size, or registration fails:

    lmcache server --chunk-size 784 --l1-size-gb 100 --eviction-policy LRU
    
  2. vLLM --max-num-batched-tokens in [N, 2·N) — setting it equal to N is the simple, always-valid choice. Outside this range LMCache raises at engine startup. align mode snapshots the Mamba state only at the end of each scheduler step, so each prefill step must advance exactly one block; a larger budget would let a step skip block boundaries, leaving no snapshot for LMCache to store at those prefixes.

  3. vLLM --mamba-cache-mode align --enable-prefix-cachingalign is mandatory (GDN backends do not support the all mode):

    vllm serve <model> \
        --enable-prefix-caching --mamba-cache-mode align \
        --max-num-batched-tokens 784 \
        --kv-transfer-config \
        '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'
    

So for a freshly-probed model the whole derivation is just: read N (step 1), then pass --chunk-size N to the server and --max-num-batched-tokens N to vLLM.

No --no-disable-hybrid-kv-cache-manager or attention-backend flag is needed; LMCacheMPConnector advertises hybrid support and vLLM auto-selects the GDN backend.

Caveats#

  • Generation is not bit-exact between a cached and a fresh run: GDN backends do not support vLLM’s batch-invariant mode. Validate with a score-level comparison (see Verifying Correctness), not a token-level diff.

  • The cached pages are byte-opaque, so content-aware features (CacheGen compression, CacheBlend) do not apply, and cache entries must not be shared across engines with different attention backends or kernel block sizes.

  • Several of these models are vision-language (they load a vision tower). The validated, supported path is text KV caching; image/video KV caching is not validated.

  • vLLM’s Mamba prefix caching in align mode is marked experimental upstream.

See the Qwen3.5 / Qwen3.6 recipe for the validated end-to-end commands and the per-model block sizes.

What Is Not Supported Yet#

  • DeepSeek-V4-style compressed / indexer caches are not yet handled by the multiprocess connector.

Verifying Correctness#

To convince yourself that a hybrid model’s KV is being cached and reused correctly, you can compare a cold run against a run served from LMCache:

  1. Run an evaluation (e.g. lm_eval on gsm8k) against vLLM + LMCache. This computes the KV cache and stores it in LMCache.

  2. Reset only vLLM’s local prefix cache, leaving the LMCache-managed cache intact (requires launching vLLM with VLLM_SERVER_DEV_MODE=1):

    curl -X POST http://localhost:8000/reset_prefix_cache
    

    Omit the reset_external=true query parameter so the LMCache cache is preserved.

  3. Re-run the same evaluation. vLLM now misses in its local cache, so the prefix KV is retrieved from LMCache. The score should match the first run.

The project ships this as the hma_lm_eval continuous-integration test (see .buildkite/k3_tests/multiprocess).

See Also#

  • Quickstart — launching the LMCache server and a vLLM client.

  • Design notes on how groups are detected and addressed: docs/design/integration/vllm/hybrid-kv-cache-groups.md in the source tree.