Qwen3_5ForConditionalGeneration#
A hybrid architecture interleaving Mamba / Gated-DeltaNet (GDN) linear-attention layers with full-attention layers. LMCache reinterprets the recurrent state caches as opaque pages at registration time; see Hybrid-Attention Models.
Validated models#
Engine documentation:
Qwen3.5 in vLLM supported models
(architecture Qwen3_5ForConditionalGeneration).
Status: Validated with LMCache.
Start the LMCache MP server. --chunk-size must be a multiple of
vLLM’s unified block size for the model — vLLM logs Setting attention
block size to N tokens at startup; for Qwen3.5-0.8B, N = 544:
lmcache server --chunk-size 544 --l1-size-gb 100 --eviction-policy LRU
Qwen3.5-0.8B (1 GPU):
vllm serve Qwen/Qwen3.5-0.8B \
--enable-prefix-caching \
--mamba-cache-mode align \
--max-num-batched-tokens 544 \
--kv-transfer-config \
'{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'
--mamba-cache-mode align is required (GDN does not support the
all mode). --max-num-batched-tokens must be at least the unified
block size and below twice it — LMCache raises at engine startup
otherwise. align snapshots the Mamba state only at scheduler-step
ends, so each prefill step must advance exactly one block for every
block boundary to hold a reusable snapshot.
For the generic LMCache + vLLM wiring (ports, remote hosts, in-process mode), see Quick Start.
Status: Not validated with LMCache.
Status: Not supported. LMCache TRT-LLM integration is in progress.
CacheBlend support#
Not supported: the hybrid groups’ cached pages are byte-opaque (see Caveats).
Compression support#
Method |
Status |
Notes |
|---|---|---|
Not supported |
Hybrid groups’ cached pages are byte-opaque. |
Caveats#
Generation is not bit-exact between a cached and a fresh run: GDN backends do not support vLLM’s batch-invariant mode. Expect score-level equivalence, not token-level (the CI gate is the
hma_lm_eval_qwen3_5gsm8k store-vs-retrieve comparison).Cached pages for the Mamba and full-attention groups are byte-opaque views, so content-aware processing does not apply, and cache entries must not be shared across engines with different attention backends or kernel block sizes.
vLLM’s Mamba prefix caching in
alignmode is experimental.