Quickstart#

LMCache has the same interface as vLLM (both online serving and offline inference). To use the online serving, you can start an OpenAI API-compatible vLLM server with LMCache via:

$ lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8

To use vLLM’s offline inference with LMCache, just simply add lmcache_vllm before the import to the vLLM components. For example

import lmcache_vllm.vllm as vllm
from lmcache_vllm.vllm import LLM

# Load the model
model = LLM.from_pretrained("lmsys/longchat-7b-16k")

# Use the model
model.generate("Hello, my name is", max_length=100)