LlamaForCausalLM#
Validated models#
Engine documentation: LlamaForCausalLM in vLLM supported models (architecture LlamaForCausalLM).
Status: Validated with LMCache.
Apply for access on the model card page and add your huggingface token as an environment variable:
export HUGGING_FACE_HUB_TOKEN=hf_xxxxxxxxxxxxxxxxx
Start the LMCache MP server:
lmcache server --l1-size-gb 100 --eviction-policy LRU
Get the chat templates for tool calling by following the Llama tool calling guide from vLLM.
Start vLLM with the LMCache MP connector:
Meta-Llama-3.1-8B (1 GPU):
vllm serve meta-llama/Meta-Llama-3.1-8B \
--trust-remote-code \
--kv-transfer-config \
'{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'
Meta-Llama-3.1-8B-Instruct (1 GPU):
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser llama3_json \
--chat-template <path_to_llama3.1_json_template> \
--kv-transfer-config \
'{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'
Meta-Llama-3.1-70B (4 GPUs):
vllm serve meta-llama/Meta-Llama-3.1-70B \
--tensor-parallel-size 4 \
--trust-remote-code \
--kv-transfer-config \
'{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'
Meta-Llama-3.1-70B-Instruct (4 GPUs):
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser llama3_json \
--chat-template <path_to_llama3.1_json_template> \
--kv-transfer-config \
'{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'
Adjust --tensor-parallel-size to match your hardware. For the
generic LMCache + vLLM wiring (ports, remote hosts, in-process mode),
see Quick Start.
Status: Not validated with LMCache.
Status: Not supported. LMCache TRT-LLM integration is in progress.
CacheBlend support#
Compression support#
Method |
Status |
Notes |
|---|---|---|
Not validated |
Caveats#
None known.