Gemma 3#

Validated models#

google/gemma-3-4b-it

vLLM

Engine documentation: Gemma 3 in vLLM supported models (architecture Gemma3ForConditionalGeneration).

Status: Validated with LMCache.

Start the LMCache MP server:

lmcache server --l1-size-gb 100 --eviction-policy LRU

Start vLLM with the LMCache MP connector:

vllm serve google/gemma-3-4b-it \
    --tensor-parallel-size 1 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'

Gemma 3 interleaves local (sliding-window) and global (full) attention layers, so vLLM keeps its hybrid KV cache manager on and exposes multiple KV cache groups. LMCache stores and retrieves all of them through its hybrid memory allocator support – LMCacheMPConnector advertises SupportsHMA, so vLLM does not auto-disable the hybrid manager and no extra configuration is required.

google/gemma-3-4b-it is a gated model; authenticate with the Hugging Face Hub (e.g. set HF_TOKEN) before serving. Adjust --tensor-parallel-size to match your hardware. For the generic LMCache + vLLM wiring (ports, remote hosts), see Quickstart.

If there are any issues with vLLM setup, please refer to the vLLM Recipes for more details.

SGLang

Status: Not validated with LMCache.

TRT-LLM

Status: Supported. See Quickstart for TRT-LLM + LMCache setup.

CacheBlend support#

Not validated.

Compression support#

Method	Status	Notes
CacheGen	Not validated

Caveats#

Gated model. google/gemma-3-4b-it requires accepting the license on Hugging Face and authenticating (e.g. HF_TOKEN) before it can be served.
Hybrid attention. Gemma 3 is a hybrid (sliding-window + full-attention) model. LMCache transfers every KV cache group via its hybrid memory allocator support, so caching works transparently. This applies to the standard paged attention used by Gemma 3; Mamba / linear-attention hybrids (whose recurrent state caches LMCache cannot yet transfer) are not supported.