CPU 内存#
概述#
CPU RAM 和本地存储是将 KV Cache 卸载到同一台运行推理的机器上的非 GPU 内存的两种方式。
配置 LMCache CPU 卸载的两种方式:#
1. 环境变量:
# 256 Tokens per KV Chunk
export LMCACHE_CHUNK_SIZE=256
# Enable CPU memory backend
export LMCACHE_LOCAL_CPU=True # default
# 5GB of Pinned CPU memory
export LMCACHE_MAX_LOCAL_CPU_SIZE=5.0 # default
2. 配置文件:
通过 LMCACHE_CONFIG_FILE=your-lmcache-config.yaml 传入
示例 config.yaml:
# 256 Tokens per KV Chunk
chunk_size: 256
# Enable CPU memory backend
local_cpu: true # default
# 5GB of Pinned CPU memory
max_local_cpu_size: 5.0 # default
CPU RAM 说明:#
LMCACHE_MAX_LOCAL_CPU_SIZE 是 LMCache 将保留的页面锁定(用于快速 GPU 传输)CPU 内存的数量,必须设置为大于 0 的数字,因为本地和远程后端在使用 GPU 传输 KV 缓存时也会使用 CPU 内存作为中间缓冲区。这意味着即使 LMCACHE_MAX_LOCAL_CPU_SIZE 设置为非零数字,也可以将 LMCACHE_LOCAL_CPU=False。
然而,建议*始终*将``LMCACHE_LOCAL_CPU=True``设置为真(默认值为``True``,因此如果不指定,CPU 卸载将自动启用),因为这允许 LMCache 保留的所有当前未使用的固定 CPU 内存用于保存 KV 缓存。当固定 CPU 内存需要用于任何磁盘或远程传输时,CPU KV 缓存将被 LRU 逐出以腾出空间,因此不会出现固定 CPU 内存耗尽的危险。
当 LMCACHE_LOCAL_CPU=True 与磁盘后端或远程后端 (Redis、Mooncake、Valkey 或 Infinistore) 一起使用时,我们可以将 CPU 内存视为一个“热缓存”,它将包含来自磁盘和远程存储的“最热”(最近访问过的)KV 缓存子集。
因此,缓存引擎还具有一个 预取 机制,可以将指定令牌的 KV 缓存从磁盘或远程存储预加载到固定的 CPU RAM 中(前提是 这些令牌的 KV 缓存已经存储在那里)。如果我们预测这些令牌将很快被请求(例如,结构化或自主工作流),这可以预先避免磁盘和远程 KV 传输的延迟。
Hugepage Support#
By default LMCache allocates CPU-pinned memory using regular 4 KiB pages. For large KV cache buffers (multiple gigabytes), enabling Linux hugepages (2 MiB pages) can reduce TLB (Translation Lookaside Buffer) pressure and improve memory access performance.
System prerequisite
Hugepages must be pre-allocated at the OS level before LMCache starts. TO find the number of pages needed, divide the desired buffer size by 2 MiB and round up. For example, 5 GB requires at least 2560 pages:
# Allocate 2560 hugepages (5 GB)
sudo sysctl -w vm.nr_hugepages=2560
# Make persistent across reboots
echo 'vm.nr_hugepages=2560' | sudo tee -a /etc/sysctl.conf
Verify that pages are available:
grep HugePages /proc/meminfo
# HugePages_Total: 2560
# HugePages_Free: 2560
Configuration
local_cpu_use_hugepages: true
Or via environment variable:
export LMCACHE_LOCAL_CPU_USE_HUGEPAGES=true
Restrictions
Hugepages are not compatible with P2P mode (
enable_p2p: true).Hugepages are not compatible with shared memory (
shm_nameis set).On non-CUDA platforms, hugepages are not supported. Regular allocation will be used as fallback.
在线推理示例#
让我们感受一下 TTFT(首次令牌时间)差异!
前提条件:
一台至少配备一个 GPU 的机器。根据您的显存和想要使用的长上下文调整 vllm 实例的最大模型长度。
已安装 vllm 和 LMCache (安装指南)
Hugging Face 访问
meta-llama/Meta-Llama-3.1-8B-Instruct
export HF_TOKEN=your_hugging_face_token
一些软件包:
pip install openai transformers
步骤 0. 为此示例设置一个目录:
mkdir lmcache-cpu-ram-example
cd lmcache-cpu-ram-example
步骤 1. 准备一个长上下文!
我们希望上下文足够长,以至于 vLLM 的前缀缓存无法在显存中保持 KV 缓存,因此需要 LMCache 将 KV 缓存保存在非显存中:
# 382757 bytes
man bash > man-bash.txt
步骤 2. 启动一个启用 CPU 卸载的 vLLM 服务器:
创建一个名为 cpu-offload.yaml 的 lmcache 配置文件
chunk_size: 256
local_cpu: true
max_local_cpu_size: 5.0
如果您不想使用配置文件,请取消注释前面三个环境变量,然后将下面的 LMCACHE_CONFIG_FILE 注释掉:
# LMCACHE_CHUNK_SIZE=256 \
# LMCACHE_LOCAL_CPU=True \
# LMCACHE_MAX_LOCAL_CPU_SIZE=5.0 \
LMCACHE_CONFIG_FILE="cpu-offload.yaml" \
vllm serve \
meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 16384 \
--kv-transfer-config \
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
--kv-transfer-config: 这是实际告诉 vLLM 使用 LMCache 进行 KV Cache 卸载的参数。kv_connector: 指定 vLLM V1 的 LMCache 连接器kv_role: 设置为 "kv_both" 以同时存储和加载 KV Cache(重要,因为我们将运行两个查询,第一个将生成/存储 KV Cache,而第二个将消费/加载该 KV Cache)
步骤 3. 使用 LMCache 查询 TTFT 改进:
一旦兼容 Open AI 的服务器在默认的 vllm 端口 8000 上运行,让我们用相同的长上下文查询两次!
创建一个名为 query-twice.py 的脚本,并粘贴以下代码:
import time
from openai import OpenAI
from transformers import AutoTokenizer
client = OpenAI(
api_key="dummy-key", # required by OpenAI client even for local servers
base_url="http://localhost:8000/v1"
)
models = client.models.list()
model = models.data[0].id
# 119512 characters total
# 26054 tokens total
long_context = ""
with open("man-bash.txt", "r") as f:
long_context = f.read()
# a truncation of the long context for the --max-model-len 16384
# if you increase the --max-model-len, you can decrease the truncation i.e.
# use more of the long context
long_context = long_context[:70000]
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
question = "Summarize bash in 2 sentences."
prompt = f"{long_context}\n\n{question}"
print(f"Number of tokens in prompt: {len(tokenizer.encode(prompt))}")
def query_and_measure_ttft():
start = time.perf_counter()
ttft = None
chat_completion = client.chat.completions.create(
messages=[{"role": "user", "content": prompt}],
model=model,
temperature=0.7,
stream=True,
)
for chunk in chat_completion:
chunk_message = chunk.choices[0].delta.content
if chunk_message is not None:
if ttft is None:
ttft = time.perf_counter()
print(chunk_message, end="", flush=True)
print("\n") # New line after streaming
return ttft - start
print("Querying vLLM server with cold LMCache CPU Offload")
cold_ttft = query_and_measure_ttft()
print(f"Cold TTFT: {cold_ttft:.3f} seconds")
print("\nQuerying vLLM server with warm LMCache CPU Offload")
warm_ttft = query_and_measure_ttft()
print(f"Warm TTFT: {warm_ttft:.3f} seconds")
print(f"\nTTFT Improvement: {(cold_ttft - warm_ttft):.3f} seconds \
({(cold_ttft/warm_ttft):.1f}x faster)")
然后运行:
python query-twice.py
由于我们处于流式模式,您将能够实时感受到 TTFT 差异!
示例输出:
Number of tokens in prompt: 15376
Querying vLLM server with cold LMCache
Bash is a Unix shell and command-line interpreter that executes commands read
from the standard input or from a file, incorporating features from the Korn
and C shells. It is an sh-compatible command language interpreter that can be
configured to be POSIX-conformant by default and is intended to be a conformant
implementation of the Shell and Utilities portion of the IEEE POSIX specification.
Cold TTFT: 6.537 seconds
Querying vLLM server with warm LMCache
Bash is a Unix shell and command-line interpreter that eead from the standard
input or from a file, incorporatinhe Korn and C shells. It is intended to be a
conformant tation of the IEEE POSIX specification and can be configured to be
POSIX-conformant by default, with options for setting the shell's behavior and
interacting with the user.
Warm TTFT: 0.147 seconds
TTFT Improvement: 6.390 seconds (44.5x faster)
如果你查看 vLLM 服务器的日志,你应该会看到(日志已为保持整洁而截断):
# Cold LMCache Miss and then Store
LMCache INFO: Reqid: chatcmpl-8676f9b9ebf04c79a5d47b9ada7b65fd, Total tokens 15410,
LMCache hit tokens: 0, need to load: 0
# you should see 8 of these storing logs total
# 2048 tokens is a multiple of the chunk size
LMCache INFO: Storing KV cache for 2048 out of 12288 tokens for request
chatcmpl-8676f9b9ebf04c79a5d47b9ada7b65fd
LMCache INFO: Storing KV cache for 2048 out of 14336 tokens for request
chatcmpl-8676f9b9ebf04c79a5d47b9ada7b65fd
LMCache INFO: Storing KV cache for 1074 out of 15410 tokens for request
chatcmpl-8676f9b9ebf04c79a5d47b9ada7b65fd
# Warm LMCache Hit!!
LMCache INFO: Reqid: chatcmpl-136d9dac1ba94bd4b4ae85007e8ad437, Total tokens 15410,
LMCache hit tokens: 15409, need to load: 1
提示:#
如果您想多次运行
query-twice.py脚本,您需要重新启动 vLLM LMCache 服务器或更改您传入的上下文前缀,因为您已经预热了 LMCache。这里的最大模型长度是通过运行仅有 23GB 显存的 L4 决定的。如果您有更多内存,可以增加最大模型长度并修改
query-twice.py以使用更多的长上下文。随着上下文长度的增加,LMCache 的 TTFT 改进变得更加明显!