Quickstart#
LMCache v1#
For LMCache v1, you can start the LMCache server with the following command:
LMCACHE_CONFIG_FILE=./lmcache_config.yaml \
LMCACHE_USE_EXPERIMENTAL=True vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--max-model-len 4096 --gpu-memory-utilization 0.8 --port 8000 \
--kv-transfer-config '{"kv_connector":"LMCacheConnector", "kv_role":"kv_both"}'
Note
For LMCache v1, please refer to the examples in the LMCache v1 section.
LMCache v1 can be directly run with the vllm serve
command.
LMCache v0#
For LMCache v0, you can start the LMCache server with the following command:
LMCache has the same interface as vLLM (both online serving and offline inference). To use the online serving, you can start an OpenAI API-compatible vLLM server with LMCache via:
$ lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8
To use vLLM’s offline inference with LMCache, just simply add lmcache_vllm
before the import to the vLLM components. For example
import lmcache_vllm.vllm as vllm
from lmcache_vllm.vllm import LLM
# Load the model
model = LLM.from_pretrained("lmsys/longchat-7b-16k")
# Use the model
model.generate("Hello, my name is", max_length=100)