DeepSeek-V4-Flash#

Validated models#

Engine documentation: DeepSeek-V4-Flash in vLLM supported models (architecture DeepseekV4ForCausalLM).

Status: Validated with LMCache.

Installing vLLM: DeepSeek-V4-Flash needs the sparse-MLA attention backends and the fp8_ds_mla KV cache kernels, so install vLLM by following its own recipe rather than a bare pip install vllm: vLLM DeepSeek-V4-Flash recipe (also mirrored at https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash).

Warning

Use the latest vLLM release, not the main/dev branch. The current vLLM development branch is broken for DeepSeek-V4-Flash (the fp4 MoE experts are misdispatched and the real weights fail to load). Pin to the latest tagged release as the vLLM recipe instructs.

Start the LMCache MP server:

lmcache server --l1-size-gb 100 --eviction-policy LRU

Start vLLM with the LMCache MP connector (8 GPUs):

vllm serve deepseek-ai/DeepSeek-V4-Flash \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --kv-cache-dtype fp8_ds_mla \
    --trust-remote-code \
    --tokenizer-mode deepseek_v4 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'

--kv-cache-dtype fp8_ds_mla and --tokenizer-mode deepseek_v4 are required for this model; --enable-expert-parallel distributes the MoE experts across the tensor-parallel ranks. Adjust --tensor-parallel-size to match your hardware. For the generic LMCache + vLLM wiring (ports, remote hosts, in-process mode), see Quick Start.

If there are any issues with vLLM setup, please refer to the vLLM Recipes for more details.

Status: Not validated with LMCache.

Status: Not supported. LMCache TRT-LLM integration is in progress.

CacheBlend support#

Compression support#

Method

Status

Notes

CacheGen

Not validated

Caveats#

  • Requires the latest vLLM release. The vLLM dev branch is currently broken for this model (see the warning above) – use a tagged release installed via the vLLM recipe.

  • Sparse-MLA hybrid KV cache. DeepSeek-V4-Flash interleaves several KV cache groups with different block geometries (the compressed MLA latents are stored as fp8/uint8 while the sparse-attention indexer groups are float32), so the groups do not share a single block size. LMCache stores and retrieves each group in its own block size; no extra flags are required beyond --kv-cache-dtype fp8_ds_mla.