GDS Backend#

Overview#

This backend will work with any file system, whether local, remote, and remote with GDS-based optimizations. Remote file systems allow for multiple LMCache instances to share data seamlessly. The GDS (GPU-Direct Storage) optimizations are used for for zero-copy I/O from GPU memory to storage systems.

Ways to configure LMCache GDS Backend#

1. Environment Variables:

LMCACHE_USE_EXPERIMENTAL MUST be set by environment variable directly.

# Specify LMCache V1
export LMCACHE_USE_EXPERIMENTAL=True
# 256 Tokens per KV Chunk
export LMCACHE_CHUNK_SIZE=256
# Path to store files
export LMCACHE_GDS_PATH="/mnt/gds/cache"
# CuFile Buffer Size in MiB
export LMCACHE_CUFILE_BUFFER_SIZE="8192"
# Disabling CPU RAM offload is sometimes recommended as the
# CPU can get in the way of GPUDirect operations
export LMCACHE_LOCAL_CPU=False

2. Configuration File:

Passed in through LMCACHE_CONFIG_FILE=your-lmcache-config.yaml

LMCACHE_USE_EXPERIMENTAL MUST be set by environment variable directly.

Example config.yaml:

# 256 Tokens per KV Chunk
chunk_size: 256
# Disable local CPU
local_cpu: false
# Path to file system, local, remote or GDS-enabled mount
gds_path: "/mnt/gds/cache"
# CuFile Buffer Size in MiB
cufile_buffer_size: 8192

CuFile Buffer Size Explanation#

The backend currently pre-registers buffer space to speed up cuFile operations. This buffer space is registered in VRAM so options like --gpu-memory-utilization from vllm should be considered when setting it. For example, a good rule of thumb for H100 which generally has 80GiBs of VRAM would be to start with 8GiB and set --gpu-memory-utilization 0.85 and depending on your workflow fine-tune it from there.

Setup Example#

Prerequisites:

  • A Machine with at least one GPU. You can adjust the max model length of your vllm instance depending on your GPU memory.

  • A mounted file system. A file system supportings GDS will work best.

  • vllm and lmcache installed (Installation Guide)

  • Hugging Face access to meta-llama/Llama-3.1-70B-Instruct

export HF_TOKEN=your_hugging_face_token

Step 1. Create cache directory under your file system mount:

To find all the types of file systems supporting GDS in your system, use gdscheck from NVIDIA:

sudo /usr/local/cuda-*/gds/tools/gdscheck -p

Check with your storage vendor on how to mount the remote file system.

(For example, if you want to use a GDS-enabled NFS driver, try the modified [NFS stack](https://vastnfs.vastdata.com/), which is an open source driver that works with any standard [NFS RDMA](https://datatracker.ietf.org/doc/html/rfc5532) server. More vendor-specific instructions will be added here in the future).

Create a directory under the file systew mount (the name here is arbitrary):

mkdir /mnt/gds/cache

Step 2. Start a vLLM server with file backend enabled:

Create a an lmcache configuration file called: gds-backend.yaml

local_cpu: false
chunk_size: 256
gds_path: "/mnt/gds/cache"
cufile_buffer_size: 8192

If you don’t want to use a config file, uncomment the first three environment variables and then comment out the LMCACHE_CONFIG_FILE below:

# LMCACHE_LOCAL_CPU=False \
# LMCACHE_CHUNK_SIZE=256 \
# LMCACHE_GDS_PATH="/mnt/gds/cache" \
# LMCACHE_CUFILE_BUFFER_SIZE=8192 \
LMCACHE_CONFIG_FILE="gds-backend.yaml" \
LMCACHE_USE_EXPERIMENTAL=True \
vllm serve \
    meta-llama/Llama-3.1-70B-Instruct \
    --max-model-len 65536 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'