DeepSeek-V4-Flash#
Validated models#
Engine documentation:
DeepSeek-V4-Flash in vLLM supported models
(architecture DeepseekV4ForCausalLM).
Status: Validated with LMCache.
Installing vLLM: DeepSeek-V4-Flash needs the sparse-MLA attention
backends and the fp8_ds_mla KV cache kernels, so install vLLM by
following its own recipe rather than a bare pip install vllm:
vLLM DeepSeek-V4-Flash recipe
(also mirrored at https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash).
Warning
Use the latest vLLM release, not the main/dev branch. The
current vLLM development branch is broken for DeepSeek-V4-Flash (the
fp4 MoE experts are misdispatched and the real weights fail to
load). Pin to the latest tagged release as the vLLM recipe instructs.
Start the LMCache MP server:
lmcache server --l1-size-gb 100 --eviction-policy LRU
Start vLLM with the LMCache MP connector (8 GPUs):
vllm serve deepseek-ai/DeepSeek-V4-Flash \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--kv-cache-dtype fp8_ds_mla \
--trust-remote-code \
--tokenizer-mode deepseek_v4 \
--kv-transfer-config \
'{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'
--kv-cache-dtype fp8_ds_mla and --tokenizer-mode deepseek_v4 are
required for this model; --enable-expert-parallel distributes the MoE
experts across the tensor-parallel ranks. Adjust
--tensor-parallel-size to match your hardware. For the generic
LMCache + vLLM wiring (ports, remote hosts, in-process mode), see
Quick Start.
If there are any issues with vLLM setup, please refer to the vLLM Recipes for more details.
Status: Not validated with LMCache.
Status: Not supported. LMCache TRT-LLM integration is in progress.
CacheBlend support#
Compression support#
Method |
Status |
Notes |
|---|---|---|
Not validated |
Caveats#
Requires the latest vLLM release. The vLLM dev branch is currently broken for this model (see the warning above) – use a tagged release installed via the vLLM recipe.
Sparse-MLA hybrid KV cache. DeepSeek-V4-Flash interleaves several KV cache groups with different block geometries (the compressed MLA latents are stored as
fp8/uint8while the sparse-attention indexer groups arefloat32), so the groups do not share a single block size. LMCache stores and retrieves each group in its own block size; no extra flags are required beyond--kv-cache-dtype fp8_ds_mla.