CPU 内存#

警告

本页记录了 LMCache 的进程内模式（已弃用）的行为。请考虑使用 LMCache MP 模式以获得更好的功能支持和性能。有关此页面的 MP 模式等效内容，请参见二级 KV 存储。

概述#

CPU RAM 和本地存储是将 KV Cache 卸载到同一台运行推理的机器的非 GPU 内存的两种方式。

配置 LMCache CPU 卸载的两种方式：#

1. 环境变量：

# 256 Tokens per KV Chunk
export LMCACHE_CHUNK_SIZE=256
# Enable CPU memory backend
export LMCACHE_LOCAL_CPU=True # default
# 5GB of Pinned CPU memory
export LMCACHE_MAX_LOCAL_CPU_SIZE=5.0 # default

2. 配置文件:

通过 LMCACHE_CONFIG_FILE=your-lmcache-config.yaml 传入

示例 config.yaml:

# 256 Tokens per KV Chunk
chunk_size: 256
# Enable CPU memory backend
local_cpu: true # default
# 5GB of Pinned CPU memory
max_local_cpu_size: 5.0 # default

CPU RAM 说明：#

LMCACHE_MAX_LOCAL_CPU_SIZE 是 LMCache 将保留的页面锁定（用于快速 GPU 传输）CPU 内存的数量，必须设置为大于 0 的数字，因为本地和远程后端在使用 GPU 传输 KV 缓存时也会使用 CPU 内存作为中间缓冲区。这意味着即使 LMCACHE_MAX_LOCAL_CPU_SIZE 设置为非零数字，也可以将 LMCACHE_LOCAL_CPU=False。

然而，建议始终将LMCACHE_LOCAL_CPU=True设置为真（默认值为True，因此如果不指定，CPU 卸载将自动启用），因为这允许 LMCache 保留的所有当前未使用的固定 CPU 内存用于保存 KV 缓存。当固定 CPU 内存需要用于任何磁盘或远程传输时，CPU KV 缓存将被 LRU 逐出以腾出空间，因此不会出现固定 CPU 内存耗尽的危险。

当 LMCACHE_LOCAL_CPU=True 与磁盘后端或远程后端 (Redis、Mooncake、Valkey 或 Infinistore) 一起使用时，我们可以将 CPU 内存视为一个“热缓存”，它将包含来自磁盘和远程存储的“最热”（最近访问过的）KV 缓存子集。

因此，缓存引擎还具有一个预取机制，可以将指定令牌的 KV 缓存从磁盘或远程存储预加载到固定的 CPU RAM 中（前提是这些令牌的 KV 缓存已经存储在那里）。如果我们预测这些令牌将很快被请求（例如，结构化或自主工作流），这可以预先避免磁盘和远程 KV 传输的延迟。

大页支持#

默认情况下，LMCache 使用常规的 4 KiB 页面分配 CPU 固定内存。对于大型 KV Cache 缓冲区（多个千兆字节），启用 Linux 大页（2 MiB 页面）可以减少 TLB（转换后备缓冲区）压力并提高内存访问性能。

系统先决条件

在 LMCache 启动之前，必须在操作系统级别预先分配大页面。要找到所需的页面数量，将所需的缓冲区大小除以 2 MiB 并向上取整。例如，5 GB 至少需要 2560 页：

# Allocate 2560 hugepages (5 GB)
sudo sysctl -w vm.nr_hugepages=2560

# Make persistent across reboots
echo 'vm.nr_hugepages=2560' | sudo tee -a /etc/sysctl.conf

验证页面是否可用：

grep HugePages /proc/meminfo
# HugePages_Total:    2560
# HugePages_Free:     2560

配置

local_cpu_use_hugepages: true

或者通过环境变量：

export LMCACHE_LOCAL_CPU_USE_HUGEPAGES=true

限制

大页 与 P2P 模式不兼容 (enable_p2p: true)。
大页与 共享内存 不兼容（shm_name 已设置）。
在非CUDA平台上，不支持大页。将使用常规分配作为后备。

在线推理示例#

让我们感受一下 TTFT（首次令牌时间）差异！

前提条件：

一台至少配备一块 GPU 的机器。根据您的显存和希望使用的长上下文调整 vllm 实例的最大模型长度。
vllm 和 lmcache 已安装 (安装指南)
Hugging Face 访问 meta-llama/Meta-Llama-3.1-8B-Instruct

export HF_TOKEN=your_hugging_face_token

一些软件包：

pip install openai transformers

第 0 步. 为此示例设置一个目录:

mkdir lmcache-cpu-ram-example
cd lmcache-cpu-ram-example

步骤 1. 准备一个长上下文！

我们希望上下文足够长，以至于 vLLM 的前缀缓存无法将 KV 缓存保留在显存中，因此需要 LMCache 将 KV 缓存保留在非显存中：

# 382757 bytes
man bash > man-bash.txt

步骤 2. 启动一个启用 CPU 卸载的 vLLM 服务器:

创建一个名为 cpu-offload.yaml 的 lmcache 配置文件

chunk_size: 256
local_cpu: true
max_local_cpu_size: 5.0

如果您不想使用配置文件，请取消注释前面三个环境变量，然后注释掉下面的 LMCACHE_CONFIG_FILE：

# LMCACHE_CHUNK_SIZE=256 \
# LMCACHE_LOCAL_CPU=True \
# LMCACHE_MAX_LOCAL_CPU_SIZE=5.0 \
LMCACHE_CONFIG_FILE="cpu-offload.yaml" \
vllm serve \
    meta-llama/Llama-3.1-8B-Instruct \
    --max-model-len 16384 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'

--kv-transfer-config: 这是实际告诉 vLLM 使用 LMCache 进行 KV Cache 卸载的参数。
- kv_connector: 指定 vLLM V1 的 LMCache 连接器
- kv_role: 设置为 "kv_both" 以同时存储和加载 KV Cache（重要，因为我们将运行两个查询，第一个将生成/存储一个 KV Cache，而第二个将消费/加载该 KV Cache）

步骤 3. 使用 LMCache 查询 TTFT 改进：

一旦 Open AI 兼容的服务器在默认的 vllm 端口 8000 上运行，让我们用相同的长上下文查询两次！

创建一个名为 query-twice.py 的脚本，并粘贴以下代码：

import time
from openai import OpenAI
from transformers import AutoTokenizer

client = OpenAI(
    api_key="dummy-key",  # required by OpenAI client even for local servers
    base_url="http://localhost:8000/v1"
)

models = client.models.list()
model = models.data[0].id

# 119512 characters total
# 26054 tokens total
long_context = ""
with open("man-bash.txt", "r") as f:
    long_context = f.read()

# a truncation of the long context for the --max-model-len 16384
# if you increase the --max-model-len, you can decrease the truncation i.e.
# use more of the long context
long_context = long_context[:70000]

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
question = "Summarize bash in 2 sentences."

prompt = f"{long_context}\n\n{question}"

print(f"Number of tokens in prompt: {len(tokenizer.encode(prompt))}")

def query_and_measure_ttft():
    start = time.perf_counter()
    ttft = None

    chat_completion = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        model=model,
        temperature=0.7,
        stream=True,
    )

    for chunk in chat_completion:
        chunk_message = chunk.choices[0].delta.content
        if chunk_message is not None:
            if ttft is None:
                ttft = time.perf_counter()
            print(chunk_message, end="", flush=True)

    print("\n")  # New line after streaming
    return ttft - start

print("Querying vLLM server with cold LMCache CPU Offload")
cold_ttft = query_and_measure_ttft()
print(f"Cold TTFT: {cold_ttft:.3f} seconds")

print("\nQuerying vLLM server with warm LMCache CPU Offload")
warm_ttft = query_and_measure_ttft()
print(f"Warm TTFT: {warm_ttft:.3f} seconds")

print(f"\nTTFT Improvement: {(cold_ttft - warm_ttft):.3f} seconds \
    ({(cold_ttft/warm_ttft):.1f}x faster)")

然后运行：

python query-twice.py

由于我们处于流式模式，您将能够实时感受到 TTFT 差异！

示例输出：

Number of tokens in prompt: 15376
Querying vLLM server with cold LMCache
Bash is a Unix shell and command-line interpreter that executes commands read
from the standard input or from a file, incorporating features from the Korn
and C shells. It is an sh-compatible command language interpreter that can be
configured to be POSIX-conformant by default and is intended to be a conformant
implementation of the Shell and Utilities portion of the IEEE POSIX specification.

Cold TTFT: 6.537 seconds

Querying vLLM server with warm LMCache
Bash is a Unix shell and command-line interpreter that eead from the standard
input or from a file, incorporatinhe Korn and C shells. It is intended to be a
conformant tation of the IEEE POSIX specification and can be configured to be
POSIX-conformant by default, with options for setting the shell's behavior and
interacting with the user.

Warm TTFT: 0.147 seconds

TTFT Improvement: 6.390 seconds (44.5x faster)

如果你查看 vLLM 服务器的日志，你应该会看到（日志已为整洁而截断）：

# Cold LMCache Miss and then Store

LMCache INFO: Reqid: chatcmpl-8676f9b9ebf04c79a5d47b9ada7b65fd, Total tokens 15410,
LMCache hit tokens: 0, need to load: 0

# you should see 8 of these storing logs total
# 2048 tokens is a multiple of the chunk size
LMCache INFO: Storing KV cache for 2048 out of 12288 tokens for request
chatcmpl-8676f9b9ebf04c79a5d47b9ada7b65fd

LMCache INFO: Storing KV cache for 2048 out of 14336 tokens for request
chatcmpl-8676f9b9ebf04c79a5d47b9ada7b65fd

LMCache INFO: Storing KV cache for 1074 out of 15410 tokens for request
chatcmpl-8676f9b9ebf04c79a5d47b9ada7b65fd

# Warm LMCache Hit!!

LMCache INFO: Reqid: chatcmpl-136d9dac1ba94bd4b4ae85007e8ad437, Total tokens 15410,
LMCache hit tokens: 15409, need to load: 1

提示：#

如果您想多次运行 query-twice.py 脚本，您需要重启 vLLM LMCache 服务器或更改您传入的上下文前缀，因为您已经预热了 LMCache。
这里的最大模型长度是通过仅使用 23GB 显存的 L4 运行决定的。如果您有更多内存，可以增加最大模型长度并修改 query-twice.py 以使用更多的长上下文。随着上下文长度的增加，LMCache 的 TTFT 改进变得更加明显！