LlamaForCausalLM#

Validated models#

Engine documentation: LlamaForCausalLM in vLLM supported models (architecture LlamaForCausalLM).

Status: Validated with LMCache.

Apply for access on the model card page and add your huggingface token as an environment variable:

export HUGGING_FACE_HUB_TOKEN=hf_xxxxxxxxxxxxxxxxx

Start the LMCache MP server:

lmcache server --l1-size-gb 100 --eviction-policy LRU

Get the chat templates for tool calling by following the Llama tool calling guide from vLLM.

Start vLLM with the LMCache MP connector:

Meta-Llama-3.1-8B (1 GPU):

vllm serve meta-llama/Meta-Llama-3.1-8B \
    --trust-remote-code \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'

Meta-Llama-3.1-8B-Instruct (1 GPU):

vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser llama3_json \
    --chat-template <path_to_llama3.1_json_template> \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'

Meta-Llama-3.1-70B (4 GPUs):

vllm serve meta-llama/Meta-Llama-3.1-70B \
    --tensor-parallel-size 4 \
    --trust-remote-code \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'

Meta-Llama-3.1-70B-Instruct (4 GPUs):

vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser llama3_json \
    --chat-template <path_to_llama3.1_json_template> \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'

Adjust --tensor-parallel-size to match your hardware. For the generic LMCache + vLLM wiring (ports, remote hosts, in-process mode), see Quick Start.

Status: Not validated with LMCache.

Status: Not supported. LMCache TRT-LLM integration is in progress.

CacheBlend support#

Compression support#

Method

Status

Notes

CacheGen

Not validated

Caveats#

None known.