Weka#

Overview#

WekaFS is a high-performance, distributed filesystem and is a supported option for KV Cache offloading in LMCache. Even though the local filesystem backend can work with a WekaFS mount, this particular backend is optimized for Weka’s characteristics. It leverages GPUDirect Storage for I/O and it allows data-sharing between multiple LMCache instances.

Ways to configure LMCache WEKA Offloading#

1. Environment Variables:

LMCACHE_USE_EXPERIMENTAL MUST be set by environment variable directly.

# Specify LMCache V1
export LMCACHE_USE_EXPERIMENTAL=True
# 256 Tokens per KV Chunk
export LMCACHE_CHUNK_SIZE=256
# Path to Weka Mount
export LMCACHE_WEKA_PATH="/mnt/weka/cache"
# CuFile Buffer Size in MiB
export LMCACHE_CUFILE_BUFFER_SIZE="8192"
# Disabling CPU RAM offload is sometimes recommended as the
# CPU can get in the way of GPUDirect operations
export LMCACHE_LOCAL_CPU=False

2. Configuration File:

Passed in through LMCACHE_CONFIG_FILE=your-lmcache-config.yaml

LMCACHE_USE_EXPERIMENTAL MUST be set by environment variable directly.

Example config.yaml:

# 256 Tokens per KV Chunk
chunk_size: 256
# Disable local CPU
local_cpu: false
# Path to Weka Mount
weka_path: "/mnt/weka/cache"
# CuFile Buffer Size in MiB
cufile_buffer_size: 8192

CuFile Buffer Size Explanation#

The backend currently pre-registers buffer space to speed up cuFile operations. This buffer space is registered in VRAM so options like --gpu-memory-utilization from vllm should be considered when setting it. For example, a good rule of thumb for H100 which generally has 80GiBs of VRAM would be to start with 8GiB and set --gpu-memory-utilization 0.85 and depending on your workflow fine-tune it from there.

Setup Example#

Prerequisites:

A Machine with at least one GPU. You can adjust the max model length of your vllm instance depending on your GPU memory.
Weka already installed and mounted.
vllm and lmcache installed (Installation Guide)
Hugging Face access to meta-llama/Llama-3.1-70B-Instruct

export HF_TOKEN=your_hugging_face_token

Step 1. Create cache directory under your Weka mount:

To find all your WekaFS mounts run:

mount -t wekafs

For the sake of this example let’s say that the above returns:

10.27.1.1/default on /mnt/weka type wekafs (rw,relatime,writecache,inode_bits=auto,readahead_kb=32768,dentry_max_age_positive=1000,dentry_max_age_negative=0,container_name=client)

Then create a directory under it (the name here is arbitrary):

mkdir /mnt/weka/cache

Step 2. Start a vLLM server with Weka offloading enabled:

Create a an lmcache configuration file called: weka-offload.yaml

local_cpu: false
chunk_size: 256
weka_path: "/mnt/weka/cache"
cufile_buffer_size: 8192

If you don’t want to use a config file, uncomment the first three environment variables and then comment out the LMCACHE_CONFIG_FILE below:

# LMCACHE_LOCAL_CPU=False \
# LMCACHE_CHUNK_SIZE=256 \
# LMCACHE_WEKA_PATH="/mnt/weka/cache" \
# LMCACHE_CUFILE_BUFFER_SIZE=8192 \
LMCACHE_CONFIG_FILE="weka-offload.yaml" \
LMCACHE_USE_EXPERIMENTAL=True \
vllm serve \
    meta-llama/Llama-3.1-70B-Instruct \
    --max-model-len 65536 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'

ValKey

Using NIXL