Docker deployment#
Prerequisites: Docker Engine 27.0+
See Installation for pulling images.
Running the container#
IMAGE=<IMAGE_NAME>:<TAG>
docker run --runtime nvidia --gpus all \
--env "HF_TOKEN=<REPLACE_WITH_YOUR_HF_TOKEN>" \
--env "LMCACHE_CHUNK_SIZE=256" \
--env "LMCACHE_LOCAL_CPU=True" \
--env "LMCACHE_MAX_LOCAL_CPU_SIZE=5" \
--volume ~/.cache/huggingface:/root/.cache/huggingface \
--network host \
$IMAGE \
meta-llama/Llama-3.1-8B-Instruct --kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'
See the docker run example for more details.
ROCm (AMD)#
The AMD Infinity hub for vLLM offers a prebuilt, optimized image for the AMD Instinct™ MI300X. See LLM inference performance validation on AMD Instinct MI300X for full instructions.
Validated environment: rocm/vllm-dev:nightly_0624_rc2_0624_rc2_20250620, MI300X, vLLM V1.
docker run -it \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path_to_your_models>:/app/model \
-e HF_HOME="/app/model" \
--name lmcache_rocm \
rocm/vllm-dev:nightly_0624_rc2_0624_rc2_20250620 \
bash