Configuring LMCache#

There are two possible ways to configure LMCache:

Using a YAML configuration file
Using environment variables

Using a YAML configuration file#

The following are the list of configurations parameters that can be set for LMCache. Configurations are set in the format of a YAML file.

# The size of the chunk as an integer
# (set to 256 by default)
chunk_size: int

# The local KV cache device to use (set to "cuda" by default)
# Possible values: "cpu", "cuda", "file://local_disk/"
local_device: Optional[str]

# The maximum size of the local KV cache as an integer (GB)
# Set to 5 by default
max_local_cache_size: int

# Remote URL for the storage backend (can be redis or redis-sentinel)
# Should have the format url://<host>:<port>
# E.g. redis://localhost:65432
# E.g. redis-sentinel://localhost:26379
remote_url: Optional[str]

# The remote serde for the backend
# Can be "cachegen", "torch", "safetensor", "fast"
remote_serde: Optional[str]

# Whether retrieve() is pipelined or not
# Set to False by default
pipelined_backend: bool

# Whether to save the decode cache
# Set to False by default
save_decode_cache: bool

# Whether to enable KV cache blending
# Set to False by default
enable_blending: bool

# The recompute ratio if KV cache blending is enabled
# Set to 0.5 by default
blend_recompute_ratio: float

# The minimum number of tokens for blending
# Set to 256 by default
blend_min_tokens: int

This configuration file can be named as lmcache_config.yaml and passed to the LMCache using the LMCACHE_CONFIG_FILE environment variable as follows:

$ LMCACHE_CONFIG_FILE=./lmcache_config.yaml lmcache_vllm serve <args>

Using environment variables#

The following are the list of environment variables that can be set for LMCache.

# The size of the chunk as an integer
# (set to 256 by default)
LM_CACHE_CHUNK_SIZE: int

# The local KV cache device to use (set to "cuda" by default)
# Possible values: "cpu", "cuda", "file://local_disk/"
LM_CACHE_LOCAL_DEVICE: Optional[str]

# The maximum size of the local KV cache as an integer (GB)
# Set to 5 by default
LM_CACHE_MAX_LOCAL_CACHE_SIZE: int

# Remote URL for the storage backend (can be redis or redis-sentinel)
# Should have the format url://<host>:<port>
# E.g. redis://localhost:65432
# E.g. redis-sentinel://localhost:26379
LM_CACHE_REMOTE_URL: Optional[str]

# The remote serde for the backend
# Can be "cachegen", "torch", "safetensor", "fast"
LM_CACHE_REMOTE_SERDE: Optional[str]

# Whether retrieve() is pipelined or not
# Set to False by default
LM_CACHE_PIPELINED_BACKEND: bool

# Whether to save the decode cache
# Set to False by default
LM_CACHE_SAVE_DECODE_CACHE: bool

# Whether to enable KV cache blending
# Set to False by default
LM_CACHE_ENABLE_BLENDING: bool

# The recompute ratio if KV cache blending is enabled
# Set to 0.5 by default
LM_CACHE_BLEND_RECOMPUTE_RATIO: float

# The minimum number of tokens for blending
# Set to 256 by default
LM_CACHE_BLEND_MIN_TOKENS: int

To run LMCache with the environment variables, you can do the following:

export LM_CACHE_CHUNK_SIZE=256
export LM_CACHE_LOCAL_DEVICE="cuda"
export LM_CACHE_MAX_LOCAL_CACHE_SIZE=5
export LM_CACHE_REMOTE_URL="redis://localhost:65432"
export LM_CACHE_REMOTE_SERDE="cachegen"
export LM_CACHE_PIPELINED_BACKEND=False
export LM_CACHE_SAVE_DECODE_CACHE=False
export LM_CACHE_ENABLE_BLENDING=False
export LM_CACHE_BLEND_RECOMPUTE_RATIO=0.5
export LM_CACHE_BLEND_MIN_TOKENS=256

lmcache_vllm serve <args>

You can wrap these lines in a file run.sh and run it as follows:

$ chmod +x run.sh
$ bash ./run.sh

Supported Models

LMCache Overview