FAQ#

What are the KV cache sizes for popular models? And why is LMCache important?#

You can calculate KV cache sizes using our KV cache calculator. We also provide a reference table below with KV cache information for some popular models.

As shown in the table, after loading Qwen/Qwen3-32B for example, there is only enough space in the spare GPU RAM to hold 275,760 tokens for KV caches. This supports only 6.73 concurrent users if each prompt is 40,960 tokens long. Once this capacity is exceeded, the KV cache must be evicted, and when the same user returns, their request needs to be re-prefilled, which takes significantly longer.

LMCache is designed to extend this virtual memory capacity, enabling you to store more KV caches and avoid costly re-prefilling operations.

KV Cache Sizes for Popular Models

Model	KV Cache Size per 1000 tokens	Spare GPU RAM for KV cache	Context length	Number of full-length prompts that can be stored in GPU
Qwen/Qwen3-8B	0.1373 GB	50.32 GB (or 366,400 tokens)	40,960 tokens	8.95x
Qwen/Qwen3-32B (tp=2 on H100)	0.2441 GB	33.66 GB × 2 (or 275,760 tokens)	40,960 tokens	6.73x
meta-llama/Llama-3.1-70B (tp=4 on H100)	0.3052 GB	32.06 GB × 4 (or 420,208 tokens)	131,072 tokens	3.21x

Note

You may also find this VRAM Calculator useful for calculating the estimated spare GPU RAM for different models and configurations.