Weka#

Overview#

WekaFS is a high-performance, distributed filesystem and is a supported option for KV Cache offloading in LMCache. Even though the local filesystem backend can work with a WekaFS mount, this particular backend is optimized for Weka’s characteristics. It leverages GPUDirect Storage for I/O and it allows data-sharing between multiple LMCache instances.

Ways to configure LMCache WEKA Offloading#

1. Environment Variables:

LMCACHE_USE_EXPERIMENTAL MUST be set by environment variable directly.

# Specify LMCache V1
export LMCACHE_USE_EXPERIMENTAL=True
# 256 Tokens per KV Chunk
export LMCACHE_CHUNK_SIZE=256
# Path to Weka Mount
export LMCACHE_WEKA_PATH="/mnt/weka/cache"
# CuFile Buffer Size in MiB
export LMCACHE_CUFILE_BUFFER_SIZE="8192"
# Disabling CPU RAM offload is sometimes recommended as the
# CPU can get in the way of GPUDirect operations
export LMCACHE_LOCAL_CPU=False

2. Configuration File:

Passed in through LMCACHE_CONFIG_FILE=your-lmcache-config.yaml

LMCACHE_USE_EXPERIMENTAL MUST be set by environment variable directly.

Example config.yaml:

# 256 Tokens per KV Chunk
chunk_size: 256
# Disable local CPU
local_cpu: false
# Path to Weka Mount
weka_path: "/mnt/weka/cache"
# CuFile Buffer Size in MiB
cufile_buffer_size: 8192

CuFile Buffer Size Explanation#

The backend currently pre-registers buffer space to speed up cuFile operations. This buffer space is registered in VRAM so options like --gpu-memory-utilization from vllm should be considered when setting it. For example, a good rule of thumb for H100 which generally has 80GiBs of VRAM would be to start with 8GiB and set --gpu-memory-utilization 0.85 and depending on your workflow fine-tune it from there.

Setup Example#

Prerequisites:

  • A Machine with at least one GPU. You can adjust the max model length of your vllm instance depending on your GPU memory.

  • Weka already installed and mounted.

  • vllm and lmcache installed (Installation Guide)

  • Hugging Face access to meta-llama/Llama-3.1-70B-Instruct

export HF_TOKEN=your_hugging_face_token

Step 1. Create cache directory under your Weka mount:

To find all your WekaFS mounts run:

mount -t wekafs

For the sake of this example let’s say that the above returns:

10.27.1.1/default on /mnt/weka type wekafs (rw,relatime,writecache,inode_bits=auto,readahead_kb=32768,dentry_max_age_positive=1000,dentry_max_age_negative=0,container_name=client)

Then create a directory under it (the name here is arbitrary):

mkdir /mnt/weka/cache

Step 2. Start a vLLM server with Weka offloading enabled:

Create a an lmcache configuration file called: weka-offload.yaml

local_cpu: false
chunk_size: 256
weka_path: "/mnt/weka/cache"
cufile_buffer_size: 8192

If you don’t want to use a config file, uncomment the first three environment variables and then comment out the LMCACHE_CONFIG_FILE below:

# LMCACHE_LOCAL_CPU=False \
# LMCACHE_CHUNK_SIZE=256 \
# LMCACHE_WEKA_PATH="/mnt/weka/cache" \
# LMCACHE_CUFILE_BUFFER_SIZE=8192 \
LMCACHE_CONFIG_FILE="weka-offload.yaml" \
LMCACHE_USE_EXPERIMENTAL=True \
vllm serve \
    meta-llama/Llama-3.1-70B-Instruct \
    --max-model-len 65536 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'