示例：将 KV Cache 卸载到 CPU#

警告

本页面记录了 LMCache 的进程内模式（已弃用）的行为。请考虑使用 LMCache MP mode 以获得更好的功能支持和性能。

在这个示例中，我们将向您展示如何将 KV Cache 卸载到 CPU 内存。

备注

除了 CPU 内存，LMCache 还支持将 KV Cache 卸载到许多不同的目标。有关更多详细信息，请参见支持的卸载目标。

先决条件#

在开始之前，请确保您具备：

安装了 LMCache 的 vLLM v1（请参见安装）
可以运行 LLM 的 GPU

在离线推理中使用 CPU 卸载#

本节演示如何在离线推理场景中使用 LMCache 和 vLLM 进行 CPU 内存卸载。我们在这里使用的示例脚本可以在 vLLM examples 中找到。请参阅 examples README 以了解如何运行 vLLM v1 的脚本。

首先，设置 LMCache 所需的环境变量：

import os

# Set token chunk size to 256
os.environ["LMCACHE_CHUNK_SIZE"] = "256"
# Enable CPU memory backend
os.environ["LMCACHE_LOCAL_CPU"] = "True"
# Set CPU memory limit to 5GB
os.environ["LMCACHE_MAX_LOCAL_CPU_SIZE"] = "5.0"

接下来，配置 vLLM 与 LMCache 集成：

from vllm import LLM, SamplingParams
from vllm.config import KVTransferConfig

# Configure KV cache transfer to use LMCache
ktc = KVTransferConfig(
    kv_connector="LMCacheConnectorV1",
    kv_role="kv_both",
)

# Initialize LLM with LMCache configuration
# Adjust gpu_memory_utilization based on your GPU memory
llm = LLM(model="Qwen/Qwen3-8B",
          kv_transfer_config=ktc,
          max_model_len=8000,
          gpu_memory_utilization=0.8)

现在您可以通过自动卸载 KV Cache 来进行推理：

# Create example prompts with shared prefix
shared_prompt = "Hello, how are you?" * 1000
prompts = [
    shared_prompt + "Hello, my name is",
]

# Define sampling parameters
sampling_params = SamplingParams(temperature=0, top_p=0.95, max_tokens=10)

# Run inference
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    generated_text = output.outputs[0].text
    print(f"Generated text: {generated_text!r}")

推理完成后，清理 LMCache 后端：

from lmcache.v1.cache_engine import LMCacheEngineBuilder
from lmcache.integration.vllm.utils import ENGINE_NAME

LMCacheEngineBuilder.destroy(ENGINE_NAME)

在推理过程中，LMCache 将自动处理在 CPU 内存中存储和管理 KV Cache。您可以通过日志监控这一过程，日志中会显示类似于以下消息：

LMCache INFO: Storing KV cache for 6006 out of 6006 tokens for request 0

这表明 KV Cache 已成功卸载到 CPU 内存。

备注

根据您的 GPU 可用内存调整 gpu_memory_utilization
可以通过 LMCACHE_MAX_LOCAL_CPU_SIZE 调整 CPU 卸载缓冲区大小。

在在线推理中使用 CPU 卸载#

本节演示如何在在线服务场景中使用 CPU 内存卸载。

首先，创建一个名为 lmcache_config.yaml 的配置文件，内容如下：

chunk_size: 256
local_cpu: true
max_local_cpu_size: 5

备注

LMCache 支持通过 lmcache_config.yaml 文件进行广泛的配置，您可以在其中自定义块大小、内存限制、存储后端等更多内容。我们将在后面的示例中介绍高级配置选项。现在，让我们运行一个使用默认配置的最小示例。

使用环境变量启动集成了 LMCache 的 vLLM 服务器。以下是一个示例命令：

LMCACHE_CONFIG_FILE=lmcache_config.yaml \
vllm serve \
    Qwen/Qwen3-8B \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1",
      "kv_role":"kv_both"
    }'

关键参数说明：

LMCACHE_CONFIG_FILE: LMCache 配置文件的路径。
--kv-transfer-config: 配置 LMCache 集成
- kv_connector: 指定 LMCache 连接器
- kv_role: 设置为 "kv_both" 以同时存储和加载 KV Cache

一旦服务器运行起来，您可以使用 curl 向其发送请求。以下是如何向集成了 LMCache 的 vLLM 服务器发送请求的示例：

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "prompt": "<|im_start|>system\nYou are a helpful AI assistant.<|im_end|>\n<|im_start|>user\nWhat is the capital of France?<|im_end|>\n<|im_start|>assistant\n",
    "max_tokens": 100,
    "temperature": 0.7
  }'

您应该看到以下日志：

LMCache INFO: Storing KV cache for 31 out of 31 tokens for request cmpl-274bcaa80837444dbf9fbba4155d2620-0 (vllm_v1_adapter.py:497:lmcache.integration.vllm.vllm_v1_adapter)

一旦您再次发送相同的 curl 请求，您应该会看到以下日志：

LMCache INFO: Reqid: cmpl-4ddf8863a6ac4dc3b6a952f2a107e9b2-0, Total tokens 31, LMCache hit tokens: 30, need to load: 14 (vllm_v1_adapter.py:543:lmcache.integration.vllm.vllm_v1_adapter)

示例：CPU 卸载的好处#

本节演示了使用 CPU 卸载与 LMCache 结合的性能优势。我们将使用一个生成多个提示的脚本，并比较使用和不使用 LMCache 的性能。

前提条件（设置）#

一块 CUDA GPU。示例会自动选择适合该 GPU 的模型：
- Qwen/Qwen3-8B (bf16) 当显存大约为 36 GiB 或更多时（例如 A100-80G, H100）。
- Qwen/Qwen3-8B-FP8 与 kv_cache_dtype="fp8" 当 GPU 具有约 24 GiB 并支持原生 FP8（Ada Lovelace / Hopper，sm_89+；例如 L4、L40、RTX 4090）。
- Qwen/Qwen3-1.7B 作为较小显卡的后备选择（约 10 GiB 及以上），包括不支持 FP8 的安培 24 GiB 显卡（RTX A5000、RTX 3090）。
足够的 CPU 内存。该示例将 LMCache 固定主机缓冲区限制为适合您的系统 RAM 和 RLIMIT_MEMLOCK (ulimit -l)，因此它也可以在不进行手动调整的小型主机上运行。

示例脚本#

将以下脚本保存为 cpu-offloading.py：

# SPDX-License-Identifier: Apache-2.0
"""
This file demonstrates the example usage of cpu offloading
with LMCache in vLLM v1.

Note that lmcache needs to be installed to run this example.
Learn more about LMCache in https://github.com/LMCache/LMCache.
"""
import os
import torch
import argparse
import time
from lmcache.v1.cache_engine import LMCacheEngineBuilder
from lmcache.integration.vllm.utils import ENGINE_NAME
from vllm import LLM, SamplingParams
from vllm.config import KVTransferConfig

def parse_arguments() -> argparse.Namespace:
    """Parse command line arguments."""
    parser = argparse.ArgumentParser(description="CPU offloading example with LMCache")
    parser.add_argument("--num-prompts", type=int, default=10,
                      help="Number of prompts to generate (default: 10)")
    parser.add_argument("--num-tokens", type=int, default=10000,
                      help="Number of tokens per prompt (default: 10000)")
    parser.add_argument("--enable-lmcache", action="store_true",
                      help="Enable LMCache for CPU offloading (default: True)")
    return parser.parse_args()

def pick_cpu_size_gb(workload_gb: float) -> float:
    """
    Clamp the LMCache pinned host buffer to fit system RAM and RLIMIT_MEMLOCK.

    cudaHostAlloc pins pages, so the buffer cannot exceed total RAM nor the
    per-process memlock limit (`ulimit -l`). On hosts where either is small,
    the original "1.5 GB per 10k tokens" formula fails with cudaErrorMemoryAllocation.

    Args:
        workload_gb: Desired buffer size for the workload, in GiB.
    Returns:
        float: A buffer size in GiB that fits both caps, never below 1.0.
    """
    import psutil

    ram_gib = psutil.virtual_memory().total / (1024 ** 3)
    try:
        import resource
        memlock_soft, _ = resource.getrlimit(resource.RLIMIT_MEMLOCK)
        memlock_gib = (
            float("inf")
            if memlock_soft == resource.RLIM_INFINITY
            else memlock_soft / (1024 ** 3)
        )
    except ImportError:
        # `resource` is POSIX-only; on Windows treat memlock as unbounded.
        memlock_gib = float("inf")
    return max(min(workload_gb, ram_gib * 0.5, memlock_gib * 0.9), 1.0)

def setup_lmcache_environment(num_prompts: int, num_tokens: int) -> None:
    """
    Configure LMCache environment variables.
    Args:
        num_prompts: Number of prompts to process
        num_tokens: Number of tokens per prompt
    """
    workload_gb = num_prompts * num_tokens * 1.5 / 10000  # 1.5 GB per 10k tokens
    cpu_size = pick_cpu_size_gb(workload_gb)

    env_vars = {
        "LMCACHE_CHUNK_SIZE": "256",         # Set tokens per chunk
        "LMCACHE_LOCAL_CPU": "True",         # Enable local CPU backend
        "LMCACHE_MAX_LOCAL_CPU_SIZE": str(cpu_size)  # CPU memory limit (GB)
    }
    for key, value in env_vars.items():
        os.environ[key] = value

def pick_model_and_kwargs() -> tuple[str, dict]:
    """
    Pick a Qwen model that fits the current GPU's memory and compute capability.

    Tiers:
        - >= 36 GiB                    -> Qwen/Qwen3-8B (bf16)
        - >= 20 GiB and sm >= 89       -> Qwen/Qwen3-8B-FP8 (native FP8)
        - >= 10 GiB                    -> Qwen/Qwen3-1.7B
        - otherwise                    -> RuntimeError

    Returns:
        tuple[str, dict]: (model id, extra kwargs to pass to ``LLM``).
    Raises:
        RuntimeError: If no CUDA GPU is visible or it is too small.
    """
    if not torch.cuda.is_available():
        raise RuntimeError("No GPU available")

    total_gib = torch.cuda.get_device_properties(0).total_memory / (1024 ** 3)
    major, minor = torch.cuda.get_device_capability(0)
    sm = major * 10 + minor
    has_fp8 = sm >= 89  # Ada Lovelace / Hopper

    if total_gib >= 36:
        return "Qwen/Qwen3-8B", {}
    if total_gib >= 20 and has_fp8:
        print(f"[fallback] GPU {total_gib:.1f} GiB sm_{sm}: using Qwen3-8B-FP8")
        return "Qwen/Qwen3-8B-FP8", {"kv_cache_dtype": "fp8"}
    if total_gib >= 10:
        print(f"[fallback] GPU {total_gib:.1f} GiB sm_{sm}: using Qwen3-1.7B")
        return "Qwen/Qwen3-1.7B", {}
    raise RuntimeError(
        f"GPU has {total_gib:.1f} GiB; need at least 10 GiB for Qwen3-1.7B"
    )

def create_test_prompts(num_prompts: int = 10, num_tokens: int = 1000) -> list[str]:
    """
    Create test prompts with index prefix and dummy body.
    Args:
        num_prompts: Number of prompts to generate
        num_tokens: Approximate number of tokens per prompt (using 'Hi ' as token unit)
    Returns:
        list: List of prompts with format '[index] Hi Hi Hi...'
    """
    prompts = []
    dummy_text = "Hi " * num_tokens

    for i in range(num_prompts):
        prompt = f"[Prompt {i}] {dummy_text} how are you?"
        prompts.append(prompt)

    return prompts

def initialize_llm(max_len: int = 16384, enable_lmcache: bool = True) -> LLM:
    """
    Initialize the LLM with a model auto-selected for the current GPU.
    Args:
        max_len: Maximum sequence length
        enable_lmcache: Whether to wire up the LMCache KV connector
    Returns:
        LLM: Configured LLM instance
    """
    model_name, extra_kwargs = pick_model_and_kwargs()

    ktc = KVTransferConfig(
        kv_connector="LMCacheConnectorV1",
        kv_role="kv_both",
    ) if enable_lmcache else None

    return LLM(
        model=model_name,
        kv_transfer_config=ktc,
        max_model_len=max_len,
        enable_prefix_caching=False,
        gpu_memory_utilization=0.9,
        **extra_kwargs,
    )

def generate_and_print_output(
    llm: LLM,
    prompts: list[str],
    sampling_params: SamplingParams,
) -> float:
    """
    Generate text and print the results.
    Args:
        llm: LLM instance
        prompts: List of input prompts
        sampling_params: Configured sampling parameters
    Returns:
        float: Time taken for generation in seconds
    """
    start_time = time.time()
    outputs = llm.generate(prompts, sampling_params)
    end_time = time.time()

    for output in outputs:
        generated_text = output.outputs[0].text
        print(f"Generated text: {generated_text!r}")

    return end_time - start_time

def main() -> None:
    """Main execution function."""
    # Parse command line arguments
    args = parse_arguments()

    # Setup environment if LMCache is enabled
    if args.enable_lmcache:
        setup_lmcache_environment(args.num_prompts, args.num_tokens)

    # Create prompts and sampling parameters
    prompts = create_test_prompts(num_prompts=args.num_prompts, num_tokens=args.num_tokens)
    sampling_params = SamplingParams(temperature=0, top_p=0.95, max_tokens=1)

    # Initialize model
    llm = initialize_llm(enable_lmcache=args.enable_lmcache)

    # First run
    print("\nFirst run:")
    first_run_time = generate_and_print_output(llm, prompts, sampling_params)
    print(f"First run time: {first_run_time:.2f} seconds")

    # Second run
    print("\nSecond run:")
    second_run_time = generate_and_print_output(llm, prompts, sampling_params)
    print(f"Second run time: {second_run_time:.2f} seconds")

    # Print speedup
    if first_run_time > 0:
        speedup = first_run_time / second_run_time
        print(f"\nSpeedup (first run / second run): {speedup:.2f}x")

    # Cleanup if LMCache was enabled
    if args.enable_lmcache:
        LMCacheEngineBuilder.destroy(ENGINE_NAME)

if __name__ == "__main__":
    main()

运行示例#

首先，运行不带 LMCache 的脚本：
```
python cpu-offloading.py
```
您将看到类似的输出：
```
Speedup (first run / second run): 1.00x
```
没有 LMCache，即使 vLLM 启用了前缀缓存，运行之间也没有加速。这是因为 KV Cache 超出了显存，无法被重用。

现在，启用 LMCache 运行：

python cpu-offloading.py --enable-lmcache

您将看到类似的输出：

Speedup (first run / second run): 7.43x

第二种情况显著的加速展示了 LMCache 如何有效管理 KV Cache 卸载到 CPU 内存。当 KV Cache 的总大小超过显存时，LMCache 允许您从 CPU 内存中存储和重用缓存，从而为具有共享前缀的提示生成更快的后续结果。

支持的卸载目标#

LMCache 现在支持将 KV Cache 卸载到以下目标：

CPU 内存
本地文件系统
Mooncake 存储
InfiniStore
：doc:Redis <../../kv_cache/storage_backends/redis>
ValKey

故障排除#

如果您遇到以下错误：

(EngineCore_DP0 pid=55437) ERROR 10-04 14:44:47 [core.py:708] RuntimeError:
Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

您可以通过以下方法解决此问题：

在环境变量中设置 VLLM_WORKER_MULTIPROC_METHOD=spawn。
或者更新 Python 代码，将 vllm 的使用放在 if __name__ == '__main__': 块中。

if __name__ == '__main__':
    from vllm import LLM, SamplingParams
    from vllm.config import KVTransferConfig
    from lmcache.v1.cache_engine import LMCacheEngineBuilder
    from lmcache.integration.vllm.utils import ENGINE_NAME
    main()

有关详细信息，请参阅 vLLM 故障排除指南：Python 多进程。