Gemma3ForConditionalGeneration#
Validated models#
Engine documentation:
Gemma 3 in vLLM supported models
(architecture Gemma3ForConditionalGeneration).
Status: Validated with LMCache.
Start the LMCache MP server:
lmcache server --l1-size-gb 100 --eviction-policy LRU
Start vLLM with the LMCache MP connector:
vllm serve google/gemma-3-4b-it \
--tensor-parallel-size 1 \
--kv-transfer-config \
'{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'
Gemma 3 interleaves local (sliding-window) and global (full) attention
layers, so vLLM keeps its hybrid KV cache manager on and exposes
multiple KV cache groups. LMCache stores and retrieves all of them through
its hybrid memory allocator support – LMCacheMPConnector advertises
SupportsHMA, so vLLM does not auto-disable the hybrid manager and no
extra configuration is required.
google/gemma-3-4b-it is a gated model; authenticate with the Hugging
Face Hub (e.g. set HF_TOKEN) before serving. Adjust
--tensor-parallel-size to match your hardware. For the generic LMCache
+ vLLM wiring (ports, remote hosts, in-process mode), see
Quick Start.
If there are any issues with vLLM setup, please refer to the vLLM Recipes for more details.
Status: Not validated with LMCache.
Status: Not supported. LMCache TRT-LLM integration is in progress.
CacheBlend support#
Not validated.
Compression support#
Method |
Status |
Notes |
|---|---|---|
Not validated |
Caveats#
Gated model.
google/gemma-3-4b-itrequires accepting the license on Hugging Face and authenticating (e.g.HF_TOKEN) before it can be served.Hybrid attention. Gemma 3 is a hybrid (sliding-window + full-attention) model. LMCache transfers every KV cache group via its hybrid memory allocator support, so caching works transparently. This applies to the standard paged attention used by Gemma 3; Mamba / linear-attention hybrids (whose recurrent state caches LMCache cannot yet transfer) are not supported.