Gemma3ForConditionalGeneration#

Validated models#

Engine documentation: Gemma 3 in vLLM supported models (architecture Gemma3ForConditionalGeneration).

Status: Validated with LMCache.

Start the LMCache MP server:

lmcache server --l1-size-gb 100 --eviction-policy LRU

Start vLLM with the LMCache MP connector:

vllm serve google/gemma-3-4b-it \
    --tensor-parallel-size 1 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'

Gemma 3 interleaves local (sliding-window) and global (full) attention layers, so vLLM keeps its hybrid KV cache manager on and exposes multiple KV cache groups. LMCache stores and retrieves all of them through its hybrid memory allocator support – LMCacheMPConnector advertises SupportsHMA, so vLLM does not auto-disable the hybrid manager and no extra configuration is required.

google/gemma-3-4b-it is a gated model; authenticate with the Hugging Face Hub (e.g. set HF_TOKEN) before serving. Adjust --tensor-parallel-size to match your hardware. For the generic LMCache + vLLM wiring (ports, remote hosts, in-process mode), see Quick Start.

If there are any issues with vLLM setup, please refer to the vLLM Recipes for more details.

Status: Not validated with LMCache.

Status: Not supported. LMCache TRT-LLM integration is in progress.

CacheBlend support#

Not validated.

Compression support#

Method

Status

Notes

CacheGen

Not validated

Caveats#

  • Gated model. google/gemma-3-4b-it requires accepting the license on Hugging Face and authenticating (e.g. HF_TOKEN) before it can be served.

  • Hybrid attention. Gemma 3 is a hybrid (sliding-window + full-attention) model. LMCache transfers every KV cache group via its hybrid memory allocator support, so caching works transparently. This applies to the standard paged attention used by Gemma 3; Mamba / linear-attention hybrids (whose recurrent state caches LMCache cannot yet transfer) are not supported.