Qwen3MoeForCausalLM#
Validated models#
Engine documentation:
Qwen3 MoE in vLLM supported models
(architecture Qwen3MoeForCausalLM).
Status: Validated with LMCache.
Start the LMCache MP server:
lmcache server --l1-size-gb 100 --eviction-policy LRU
Qwen3-235B-A22B (4 GPUs, expert parallel):
vllm serve Qwen/Qwen3-235B-A22B \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--reasoning-parser qwen3 \
--kv-transfer-config \
'{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'
Qwen3-30B-A3B (1 GPU):
vllm serve Qwen/Qwen3-30B-A3B \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--reasoning-parser qwen3 \
--kv-transfer-config \
'{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'
Qwen3-Coder-480B-A35B-Instruct-FP8 (8 GPUs, expert parallel):
vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--kv-transfer-config \
'{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'
Qwen3-Coder-30B-A3B-Instruct (1 GPU):
vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--kv-transfer-config \
'{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'
Adjust --tensor-parallel-size to match your hardware. For the
generic LMCache + vLLM wiring (ports, remote hosts, in-process mode),
see Quick Start.
If there are any issues with vLLM setup, please refer to the vLLM Recipes for more details.
Status: Not validated with LMCache.
Status: Not supported. LMCache TRT-LLM integration is in progress.
CacheBlend support#
Compression support#
Method |
Status |
Notes |
|---|---|---|
Not validated |
Caveats#
None known.