MiniMax M3#
Validated models#
Engine documentation:
MiniMax-M3 in vLLM supported models
(architecture MiniMaxM3SparseForConditionalGeneration).
Status: Validated with LMCache.
Start the LMCache MP server:
lmcache server --l1-size-gb 100 --eviction-policy LRU
Start vLLM with the LMCache MP connector (8 GPUs):
vllm serve MiniMaxAI/MiniMax-M3 \
--tensor-parallel-size 8 \
--trust-remote-code \
--block-size 128 \
--kv-transfer-config \
'{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'
--block-size 128 is required for this model (see Caveats); the
smaller defaults fail vLLM’s KV-cache init. --trust-remote-code loads
M3’s custom architecture. Adjust --tensor-parallel-size to your
hardware — M3’s weights need eight 140 GB-class GPUs. For the generic
LMCache + vLLM wiring (ports, remote hosts), see
Quickstart.
If there are any issues with vLLM setup, please refer to the vLLM Recipes for more details.
Status: Not validated with LMCache.
Status: Not validated with LMCache.
CacheBlend support#
Compression support#
Method |
Status |
Notes |
|---|---|---|
Not validated |
Caveats#
Sparse attention with a lightning indexer. M3 runs grouped-query full attention plus a DeepSeek-style sparse-attention indexer. Each sparse layer owns two paged caches — the main K/V (rank-5) and a key-only indexer cache (rank-3) — which vLLM places in a single
UniformTypeKVCacheSpecsengine group. LMCache detects both layouts and stores/retrieves each as its own group; the indexer keys travel with the K/V because they cannot be recomputed from the cached K/V on a hit.``–block-size 128`` is required. M3’s indexer uses
sparse_block_size = 128; vLLM cannot reconcile the default block size (16) or 64 across the full-attention and sparse kernels and aborts KV-cache init withNo common block size. Use 128.LMCache chunk size must be a multiple of the block size. The default chunk size (256) already satisfies 128, so no extra flag is needed; if you pass
--chunk-sizeto the server, keep it a multiple of 128.