GDS Backend#
Overview#
This backend will work with any file system, whether local, remote, and remote with GDS-based optimizations. Remote file systems allow for multiple LMCache instances to share data seamlessly. The GDS (GPU-Direct Storage) optimizations are used for for zero-copy I/O from GPU memory to storage systems.
Ways to configure LMCache GDS Backend#
1. Environment Variables:
LMCACHE_USE_EXPERIMENTAL
MUST be set by environment variable directly.
# Specify LMCache V1
export LMCACHE_USE_EXPERIMENTAL=True
# 256 Tokens per KV Chunk
export LMCACHE_CHUNK_SIZE=256
# Path to store files
export LMCACHE_GDS_PATH="/mnt/gds/cache"
# CuFile Buffer Size in MiB
export LMCACHE_CUFILE_BUFFER_SIZE="8192"
# Disabling CPU RAM offload is sometimes recommended as the
# CPU can get in the way of GPUDirect operations
export LMCACHE_LOCAL_CPU=False
2. Configuration File:
Passed in through LMCACHE_CONFIG_FILE=your-lmcache-config.yaml
LMCACHE_USE_EXPERIMENTAL
MUST be set by environment variable directly.
Example config.yaml
:
# 256 Tokens per KV Chunk
chunk_size: 256
# Disable local CPU
local_cpu: false
# Path to file system, local, remote or GDS-enabled mount
gds_path: "/mnt/gds/cache"
# CuFile Buffer Size in MiB
cufile_buffer_size: 8192
CuFile Buffer Size Explanation#
The backend currently pre-registers buffer space to speed up cuFile operations. This buffer space
is registered in VRAM so options like --gpu-memory-utilization
from vllm
should be considered
when setting it. For example, a good rule of thumb for H100 which generally has 80GiBs of VRAM would
be to start with 8GiB and set --gpu-memory-utilization 0.85
and depending on your workflow fine-tune
it from there.
Setup Example#
Prerequisites:
A Machine with at least one GPU. You can adjust the max model length of your vllm instance depending on your GPU memory.
A mounted file system. A file system supportings GDS will work best.
vllm and lmcache installed (Installation Guide)
Hugging Face access to
meta-llama/Llama-3.1-70B-Instruct
export HF_TOKEN=your_hugging_face_token
Step 1. Create cache directory under your file system mount:
To find all the types of file systems supporting GDS in your system, use gdscheck from NVIDIA:
sudo /usr/local/cuda-*/gds/tools/gdscheck -p
Check with your storage vendor on how to mount the remote file system.
(For example, if you want to use a GDS-enabled NFS driver, try the modified [NFS stack](https://vastnfs.vastdata.com/), which is an open source driver that works with any standard [NFS RDMA](https://datatracker.ietf.org/doc/html/rfc5532) server. More vendor-specific instructions will be added here in the future).
Create a directory under the file systew mount (the name here is arbitrary):
mkdir /mnt/gds/cache
Step 2. Start a vLLM server with file backend enabled:
Create a an lmcache configuration file called: gds-backend.yaml
local_cpu: false
chunk_size: 256
gds_path: "/mnt/gds/cache"
cufile_buffer_size: 8192
If you don’t want to use a config file, uncomment the first three environment variables
and then comment out the LMCACHE_CONFIG_FILE
below:
# LMCACHE_LOCAL_CPU=False \
# LMCACHE_CHUNK_SIZE=256 \
# LMCACHE_GDS_PATH="/mnt/gds/cache" \
# LMCACHE_CUFILE_BUFFER_SIZE=8192 \
LMCACHE_CONFIG_FILE="gds-backend.yaml" \
LMCACHE_USE_EXPERIMENTAL=True \
vllm serve \
meta-llama/Llama-3.1-70B-Instruct \
--max-model-len 65536 \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'