MiniMax M3#

Validated models#

MiniMaxAI/MiniMax-M3

vLLM

Engine documentation: MiniMax-M3 in vLLM supported models (architecture MiniMaxM3SparseForConditionalGeneration).

Status: Validated with LMCache.

Start the LMCache MP server:

lmcache server --l1-size-gb 100 --eviction-policy LRU

Start vLLM with the LMCache MP connector (8 GPUs):

vllm serve MiniMaxAI/MiniMax-M3 \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --block-size 128 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'

--block-size 128 is required for this model (see Caveats); the smaller defaults fail vLLM’s KV-cache init. --trust-remote-code loads M3’s custom architecture. Adjust --tensor-parallel-size to your hardware — M3’s weights need eight 140 GB-class GPUs. For the generic LMCache + vLLM wiring (ports, remote hosts), see Quickstart.

If there are any issues with vLLM setup, please refer to the vLLM Recipes for more details.

SGLang

Status: Not validated with LMCache.

TRT-LLM

Status: Not validated with LMCache.

CacheBlend support#

Compression support#

Method	Status	Notes
CacheGen	Not validated

Caveats#

Sparse attention with a lightning indexer. M3 runs grouped-query full attention plus a DeepSeek-style sparse-attention indexer. Each sparse layer owns two paged caches — the main K/V (rank-5) and a key-only indexer cache (rank-3) — which vLLM places in a single UniformTypeKVCacheSpecs engine group. LMCache detects both layouts and stores/retrieves each as its own group; the indexer keys travel with the K/V because they cannot be recomputed from the cached K/V on a hit.
``–block-size 128`` is required. M3’s indexer uses sparse_block_size = 128; vLLM cannot reconcile the default block size (16) or 64 across the full-attention and sparse kernels and aborts KV-cache init with No common block size. Use 128.
LMCache chunk size must be a multiple of the block size. The default chunk size (256) already satisfies 128, so no extra flag is needed; if you pass --chunk-size to the server, keep it a multiple of 128.