Docker deployment#

Prerequisites: Docker Engine 27.0+

See Installation for pulling images.

Running the container#

IMAGE=<IMAGE_NAME>:<TAG>
docker run --runtime nvidia --gpus all \
    --env "HF_TOKEN=<REPLACE_WITH_YOUR_HF_TOKEN>" \
    --env "LMCACHE_CHUNK_SIZE=256" \
    --env "LMCACHE_LOCAL_CPU=True" \
    --env "LMCACHE_MAX_LOCAL_CPU_SIZE=5" \
    --volume ~/.cache/huggingface:/root/.cache/huggingface \
    --network host \
    $IMAGE \
    meta-llama/Llama-3.1-8B-Instruct --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'

See the docker run example for more details.

ROCm (AMD)#

The AMD Infinity hub for vLLM offers a prebuilt, optimized image for the AMD Instinct™ MI300X. See LLM inference performance validation on AMD Instinct MI300X for full instructions.

Validated environment: rocm/vllm-dev:nightly_0624_rc2_0624_rc2_20250620, MI300X, vLLM V1.

docker run -it \
    --network=host \
    --group-add=video \
    --ipc=host \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device /dev/kfd \
    --device /dev/dri \
    -v <path_to_your_models>:/app/model \
    -e HF_HOME="/app/model" \
    --name lmcache_rocm \
    rocm/vllm-dev:nightly_0624_rc2_0624_rc2_20250620 \
    bash