Supported Models#

LMCache supports a variety of models, including using models from Huggingface directly via the model card.

Note

Only following models are optimized for Cachegen based compression/decompression. Other models can run Cachegen but may not be optimized for it.

To use vLLM’s offline inference with LMCache, for any model, use the required model card name as on Huggingface.

# This is for LMCache v0
import lmcache_vllm.vllm as vllm
from lmcache_vllm.vllm import LLM

# model card (Huggingface model card format name)
model_card = "insert here"

# Load the model
model = LLM.from_pretrained(model_card)

# Use the model
model.generate("Hello, my name is", max_length=100)

Note

To use the models, you might often require setting up a Huggingface-login token, after you accept the terms and conditions of the model. To do so, you can add the following to the top of your Python script:

from huggingface_hub import login
login()

# You will now be prompted to enter your Huggingface login credentials.

For more information on Huggingface login, please refer to the Huggingface documentation.

KV blending

LMCache Overview