Welcome to LMCache!#

Supercharge Your LLM with the Fastest KV Cache Layer.

Note

We are currently in the process of upgrading our documentation to provide better guidance and examples. Some sections may be under construction. Thank you for your patience!

Star Watch Fork

LMCache lets LLMs prefill each text only once. By storing the KV caches of all reusable texts, LMCache can reuse the KV caches of any reused text (not necessarily prefix) in any serving engine instance. It thus reduces prefill delay, i.e., time to first token (TTFT), as well as saves the precious GPU cycles and memory. By combining LMCache with vLLM, LMCaches achieves 3-10x delay savings and GPU cycle reduction in many LLM use cases, including multi-round QA and RAG.

For more information, check out the following:

LMCache blogs
Join LMCache slack workspace
Our papers:

Documentation#

Getting Started

Installation
Quickstart Examples
TroubleShoot
FAQ

Disaggregated prefill

Using NIXL
- Examples
Using shared storage

KV Cache management

LMCache Controller
Lookup the KV cache
- Example usage:
Persist the KV cache
Clear the KV cache
- Example usage:
Move the KV cache
Compress the KV cache
Check finish of a control event

KV Cache Optimizations

Compression
- CacheGen
Blending

Use LMCache in production

Docker deployment
- Running the container image
Kubernetes deployment

Developer Guide

Contributing Guide
Dockerfile
- Building the container image
Usage Data Module
- Usage Stats Collection

API Reference

Configuring LMCache
Adding new storage backends
vLLM Dynamic Connector
- Upstream Integration:
- Dynamic Connector:

Community

Community meetings
- Meeting schedule
Blogs

raw-html:<br />

Installation