Qwen3MoeForCausalLM#

Validated models#

Engine documentation: Qwen3 MoE in vLLM supported models (architecture Qwen3MoeForCausalLM).

Status: Validated with LMCache.

Start the LMCache MP server:

lmcache server --l1-size-gb 100 --eviction-policy LRU

Qwen3-235B-A22B (4 GPUs, expert parallel):

vllm serve Qwen/Qwen3-235B-A22B \
    --tensor-parallel-size 4 \
    --enable-expert-parallel \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --reasoning-parser qwen3 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'

Qwen3-30B-A3B (1 GPU):

vllm serve Qwen/Qwen3-30B-A3B \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --reasoning-parser qwen3 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'

Qwen3-Coder-480B-A35B-Instruct-FP8 (8 GPUs, expert parallel):

vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'

Qwen3-Coder-30B-A3B-Instruct (1 GPU):

vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'

Adjust --tensor-parallel-size to match your hardware. For the generic LMCache + vLLM wiring (ports, remote hosts, in-process mode), see Quick Start.

If there are any issues with vLLM setup, please refer to the vLLM Recipes for more details.

Status: Not validated with LMCache.

Status: Not supported. LMCache TRT-LLM integration is in progress.

CacheBlend support#

Compression support#

Method

Status

Notes

CacheGen

Not validated

Caveats#

None known.