Docker Installation#
LMCache offers an official Docker image for deployment. The image is available on Docker Hub at lmcache/lmcache_vllm .
Note
Make sure you have Docker installed on your machine. You can install Docker from here.
Pulling the Docker Image:#
To get started, pull the official Docker image with the following command:
docker pull lmcache/lmcache_vllm:lmcache-0.1.4
Running the Docker Container#
To run the Docker container with your specified model, follow these steps:
Define the Model:
# define the model here
export model=meta-llama/Llama-3.2-1B
Create Configuration and Chat Template Files
Save the following YAML code to a file, such as example.yaml
, in the LMCache repository:
chunk_size: 256
local_device: "cpu"
# Whether retrieve() is pipelined or not
pipelined_backend: False
Note
Some models may require a chat template, if you’re using a non-instruct model
(for instruct models such as llama-3.1-70b-instruct
you don’t need it). In needed,
save the chat template code below to a file, chat-template.txt
, in the LMCache repository:
{%- if messages[0]['role'] == 'system' -%}
{%- set system_message = messages[0]['content'] -%}
{%- set messages = messages[1:] -%}
{%- else -%}
{% set system_message = '' -%}{%- endif -%}
{{ bos_token + system_message }}
{%- for message in messages -%}
{%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
{%- endif -%}
{%- if message['role'] == 'user' -%}
{{ 'USER: ' + message['content'] + '\n' }}
{%- elif message['role'] == 'assistant' -%}
{{ 'ASSISTANT: ' + message['content'] + eos_token + '\n' }}
{%- endif -%}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{ 'ASSISTANT:' }}
{% endif %}
Run the Docker Command:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v <Path to LMCache>:/etc/lmcache \
-p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=<Your Huggingface Token>" \
--env "LMCACHE_CONFIG_FILE=/etc/lmcache/example.yaml"\
--env "VLLM_WORKER_MULTIPROC_METHOD=spawn"\
--ipc=host \
--network=host \
lmcache/lmcache_vllm:lmcache-0.1.3 \
$model --gpu-memory-utilization 0.7 --port 8000 \
Note
If using a model that requires a chat template, make sure to include
the --chat_template
flag in the command. If the chat template file
is named chat-template.txt
, add to the run
command:
--chat_template /etc/lmcache/chat-template.txt
Testing the Docker Container#
To verify the setup, you can test it using the following curl
command:
curl -X 'POST' \
'http://127.0.0.1:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama/Llama-3.2-1B",
"messages": [
{"role": "system", "content": "You are a helpful AI coding assistant."},
{"role": "user", "content": "Write a segment tree implementation in python"}
],
"max_tokens": 150
}'
Building Docker from Source#
Note
This section is for users who want to build the Docker image from source. For this please visit the link here lmcache-vllm.