Uniform Attention Models#
Recipes for standard (uniform) attention transformer architectures validated end-to-end with LMCache, with a recipe page per architecture covering only the LMCache-specific configuration that diverges from defaults.
These models use a single attention type across all layers, so vLLM serves them with one KV cache group. Models that interleave multiple attention types (sliding-window + full, or Mamba / linear-attention + full) are covered under Hybrid Attention Models.
Recipe page contents#
Each recipe page is intentionally minimal:
Validated models – exact HF repo IDs that have been tested.
Engine tabs – one tab per serving engine (vLLM, SGLang, TRT-LLM). Each tab links to the engine’s own documentation for the model and shows the exact
lmcache serverand engine launch commands. Tabs for engines that are not yet validated state so explicitly.CacheBlend support – validation status (may be empty).
Compression support – table of compression methods (CacheGen, etc.) with per-method validation status. Extensible: new methods get a row.
Caveats – known limitations, if any.
For the generic LMCache + engine wiring (ports, remote hosts, sending a first request), see Quickstart. Recipes assume that page as a prerequisite.
Supported architectures#
Architecture |
Example HF model |
vLLM |
SGLang |
TRT-LLM |
Recipe |
|---|---|---|---|---|---|
|
|
✓ |
— |
— |
|
|
|
✓ |
— |
— |
|
|
|
✓ |
— |
— |
|
|
|
✓ |
— |
— |
|
|
|
✓ |
— |
— |
|
|
|
✓ |
— |
— |
Legend: ✓ validated, — not validated.
Contributing a recipe#
To add a new uniform-attention architecture:
Copy an existing page (e.g.
minimax_m2.rst) torecipes/<architecture_snake_case>.rst.Fill in Validated models, Engines, LMCache configuration, and Caveats. Keep each section terse – if a field has nothing to say, say so in one line rather than padding it.
Add a row to the table above and an entry to the hidden toctree below.
(For models that interleave attention types, add the page under Hybrid Attention Models instead.)