GlmMoeDsaForCausalLM#
A large Mixture-of-Experts model using Dynamic Sparse Attention (DSA), shared
by the GLM-5.2 series. Like DeepSeek-V4-Flash, the sparse-attention path
splits the model’s layers into more than one KV cache group; the
LMCacheMPConnector stores and retrieves each group in its own block size, so
KV reuse works without extra flags.
Validated models#
zai-org/GLM-5.2-FP8 (8 GPUs)
Engine documentation:
GLM-5.2 in vLLM supported models
(architecture GlmMoeDsaForCausalLM). See also the
vLLM GLM-5.2 recipe.
Status: Validated with LMCache (vLLM 0.23.0 + LMCache 0.4.7).
Start the LMCache MP server:
lmcache server \
--port 6555 \
--max-workers 8 \
--l1-size-gb 100 \
--eviction-policy LRU \
--chunk-size 1024
Start vLLM with the LMCache MP connector (8 GPUs):
vllm serve zai-org/GLM-5.2-FP8 \
--tensor-parallel-size 8 \
--tool-call-parser glm47 \
--enable-auto-tool-choice \
--reasoning-parser glm45 \
--no-enable-prefix-caching \
--kv-transfer-config \
'{"kv_connector":"LMCacheMPConnector","kv_role":"kv_both","kv_connector_extra_config":{"lmcache.mp.port":6555}}'
--tool-call-parser glm47, --enable-auto-tool-choice, and
--reasoning-parser glm45 are GLM-5.2’s serving requirements (see the
vLLM recipe). --no-enable-prefix-caching routes all KV reuse through
LMCache rather than vLLM’s in-engine prefix cache. The server’s
--port 6555 must match lmcache.mp.port in the connector config;
--max-workers is set to the tensor-parallel size. Adjust
--tensor-parallel-size to match your hardware. For the generic
LMCache + vLLM wiring (ports, remote hosts), see
Quickstart.
If there are any issues with vLLM setup, please refer to the vLLM Recipes for more details.
Status: Not validated with LMCache.
Status: Supported. See Quickstart for TRT-LLM + LMCache setup.
CacheBlend support#
Not validated.
Compression support#
Method |
Status |
Notes |
|---|---|---|
Not validated |
Caveats#
Dynamic Sparse Attention KV groups. GLM-5.2’s DSA path splits the model’s layers into more than one KV cache group with different block geometries. LMCache stores and retrieves each group in its own block size; no extra flags are required beyond the launch commands above.