多模态模型的 KV Cache 与 vLLM#

概述#

vLLM 正在构建其多模态能力,目前支持以下 多模态语言模型列表

因此,LMCache 可用于加速 vLLM 支持的所有多模态模型的推理时间。本文档展示了在 vLLM 中使用 LMCache 进行 KV 缓存的加速改进。

不同多模态类型的 TTFT 加速示例#

前提条件#

  • 一台至少配备一块 GPU 的机器。您可以根据您的显存调整 vLLM 实例的最大模型长度。

  • vLLM 和 LMCache 已安装 (安装指南)

  • vLLM 音频依赖项已安装:pip install vllm[audio]

示例#

备注

下面的示例使用 Python 脚本对托管在 vLLM 上的多模态模型进行推理。该脚本是 vLLM 中的 openai_chat_completion_client_for_multimodal python 脚本。您需要将其下载到本地以运行下面的示例。该脚本在后面的 参考部分 中打印供您查阅。请转到 示例输出 部分查看 vLLM 日志中的输出,以展示加速改进。

使用 Ultravox 进行音频推理:#

使用 fixie-ai/ultravox-v0_5-llama-3_2-1b 模型和 LMCache KV 缓存启动 vLLM 服务器:

vllm serve fixie-ai/ultravox-v0_5-llama-3_2-1b \
    --max-model-len 4096 --trust-remote-code \
    --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'

运行 Python 脚本两次,以演示由于缓存导致的第二次推理的 TTFT 加速:

# run twice to see TTFT speedup
python openai_chat_completion_client_for_multimodal.py --chat-type audio
使用 Llava 进行单图像推理:#

使用 llava-hf/llava-1.5-7b-hf 模型和 LMCache KV 缓存启动 vLLM 服务器:

vllm serve llava-hf/llava-1.5-7b-hf \
    --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'

运行 Python 脚本两次,以演示由于缓存导致的第二次推理的 TTFT 加速:

# run twice to see TTFT speedup
python openai_chat_completion_client_for_multimodal.py --chat-type single-image
使用 Phi-3.5-vision-instruct 进行多图像推理:#

使用 microsoft/Phi-3.5-vision-instruct 模型和 LMCache KV 缓存启动 vLLM 服务器:

vllm serve microsoft/Phi-3.5-vision-instruct \
    --trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}' \
    --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'

运行 Python 脚本两次,以演示由于缓存导致的第二次推理的 TTFT 加速:

# run twice to see TTFT speedup
python openai_chat_completion_client_for_multimodal.py --chat-type multi-image
使用 Llava-OneVision 进行视频推理:#

使用 llava-hf/llava-onevision-qwen2-7b-ov-hf 模型和 LMCache KV 缓存启动 vLLM 服务器:

vllm serve llava-hf/llava-onevision-qwen2-7b-ov-hf \
    --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'

运行 Python 脚本两次,以演示由于缓存导致的第二次推理的 TTFT 加速:

# run twice to see TTFT speedup
python openai_chat_completion_client_for_multimodal.py --chat-type video

示例输出#

在运行上述示例时,您会注意到 vLLM 日志中类似于以下的输出。

该第一个输出演示了令牌被缓存和加载的过程。

[2025-08-04 22:43:35,484] LMCache INFO: Reqid: chatcmpl-05e2d296601046b29210f53a1fa30b13, Total tokens 1536, LMCache hit tokens: 1536, need to load: 1535 (vllm_v1_adapter.py:803:lmcache.integration.vllm.vllm_v1_adapter)

这显示了第一次和第二次运行之间的加速。

  1. 第一次请求:

Chat completion output from input audio: It seems like you're excitedly sharing your thoughts and predictions about a game you're about to watch. The audio appears to be a stream of comments or a social media post. The words "one pitch on the way to Edgar Martinez" suggest that someone is saying something in a baseball chat or social media post about the
Chat completion output from audio url: It appears to be a enthusiastic and excited baseball comment from an individual. The content seems to be a play-by-play description of a specific baseball game, with the narrator belonging to a team that is competing in the American League Championship Series. The reference to the player Edgar Martinez is a nod to a well-known baseball player,
Chat completion output from base64 encoded audio: It seems like you're excited about a baseball game, possibly the Los Angeles Dodgers or the Boston Red Sox, but it's unclear which one. The text mentions a "pitcher" and "swung on the line," but it's not entirely obvious which team it's referring to.

However, the mention of "
Time taken: 50.828808307647705 seconds
  1. 第二个请求:

Chat completion output from input audio: It seems like you're extremely excited about the possibility of the San Francisco Giants winning the American League championship and playing in the World Series. The audio is filled with emotions and a sense of optimism, with you enthusiastically expressing your thoughts and feelings. It's clear that this is a significant moment for you, particularly given the fact
Chat completion output from audio url: I can tell you're excited about a baseball game. It seems like you're reliving a moment during the middle of a game, especially the highlight of a six runs game for the Golden Giants. The audio appears to include a local sports radio talk show style broadcast, with a ringer ("the guy" in the
Chat completion output from base64 encoded audio: It seems like you're having a lively discussion about a Major League Baseball game, specifically about the shortstop playing for the Mariners and,Mario Upton swinging at a pitch and eventually being thrown out on a play at the plate. The atmosphere is excited, with all the cheering and commentary you've written. It appears to
Time taken: 3.3407371044158936 seconds

参考:在 vLLM 示例 Python 脚本中推理多模态模型#

源: https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_chat_completion_client_for_multimodal.py

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

import base64

import requests
from openai import OpenAI

from vllm.utils import FlexibleArgumentParser

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
from openai import APIConnectionError, OpenAI
from openai.pagination import SyncPage
from openai.types.model import Model

import time


def get_first_model(client: OpenAI) -> str:
    """
    Get the first model from the vLLM server.
    """
    try:
        models: SyncPage[Model] = client.models.list()
    except APIConnectionError as e:
        raise RuntimeError(
            "Failed to get the list of models from the vLLM server at "
            f"{client.base_url} with API key {client.api_key}. Check\n"
            "1. the server is running\n"
            "2. the server URL is correct\n"
            "3. the API key is correct"
        ) from e

    if len(models.data) == 0:
        raise RuntimeError(f"No models found on the vLLM server at {client.base_url}")

    return models.data[0].id

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)


def encode_base64_content_from_url(content_url: str) -> str:
    """Encode a content retrieved from a remote url to base64 format."""

    with requests.get(content_url) as response:
        response.raise_for_status()
        result = base64.b64encode(response.content).decode("utf-8")

    return result


# Text-only inference
def run_text_only(model: str) -> None:
    chat_completion = client.chat.completions.create(
        messages=[{"role": "user", "content": "What's the capital of France?"}],
        model=model,
        max_completion_tokens=64,
    )

    result = chat_completion.choices[0].message.content
    print("Chat completion output:", result)


# Single-image input inference
def run_single_image(model: str) -> None:
    ## Use image url in the payload
    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
    chat_completion_from_url = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What's in this image?"},
                    {
                        "type": "image_url",
                        "image_url": {"url": image_url},
                    },
                ],
            }
        ],
        model=model,
        max_completion_tokens=64,
    )

    result = chat_completion_from_url.choices[0].message.content
    print("Chat completion output from image url:", result)

    ## Use base64 encoded image in the payload
    image_base64 = encode_base64_content_from_url(image_url)
    chat_completion_from_base64 = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What's in this image?"},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"},
                    },
                ],
            }
        ],
        model=model,
        max_completion_tokens=64,
    )

    result = chat_completion_from_base64.choices[0].message.content
    print("Chat completion output from base64 encoded image:", result)


# Multi-image input inference
def run_multi_image(model: str) -> None:
    image_url_duck = "https://upload.wikimedia.org/wikipedia/commons/d/da/2015_Kaczka_krzy%C5%BCowka_w_wodzie_%28samiec%29.jpg"
    image_url_lion = "https://upload.wikimedia.org/wikipedia/commons/7/77/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg"
    chat_completion_from_url = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What are the animals in these images?"},
                    {
                        "type": "image_url",
                        "image_url": {"url": image_url_duck},
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": image_url_lion},
                    },
                ],
            }
        ],
        model=model,
        max_completion_tokens=64,
    )

    result = chat_completion_from_url.choices[0].message.content
    print("Chat completion output:", result)


# Video input inference
def run_video(model: str) -> None:
    video_url = "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4"
    video_base64 = encode_base64_content_from_url(video_url)

    ## Use video url in the payload
    chat_completion_from_url = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What's in this video?"},
                    {
                        "type": "video_url",
                        "video_url": {"url": video_url},
                    },
                ],
            }
        ],
        model=model,
        max_completion_tokens=64,
    )

    result = chat_completion_from_url.choices[0].message.content
    print("Chat completion output from image url:", result)

    ## Use base64 encoded video in the payload
    chat_completion_from_base64 = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What's in this video?"},
                    {
                        "type": "video_url",
                        "video_url": {"url": f"data:video/mp4;base64,{video_base64}"},
                    },
                ],
            }
        ],
        model=model,
        max_completion_tokens=64,
    )

    result = chat_completion_from_base64.choices[0].message.content
    print("Chat completion output from base64 encoded image:", result)


# Audio input inference
def run_audio(model: str) -> None:
    from vllm.assets.audio import AudioAsset

    audio_url = AudioAsset("winning_call").url
    audio_base64 = encode_base64_content_from_url(audio_url)

    # OpenAI-compatible schema (`input_audio`)
    chat_completion_from_base64 = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What's in this audio?"},
                    {
                        "type": "input_audio",
                        "input_audio": {
                            # Any format supported by librosa is supported
                            "data": audio_base64,
                            "format": "wav",
                        },
                    },
                ],
            }
        ],
        model=model,
        max_completion_tokens=64,
    )

    result = chat_completion_from_base64.choices[0].message.content
    print("Chat completion output from input audio:", result)

    # HTTP URL
    chat_completion_from_url = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What's in this audio?"},
                    {
                        "type": "audio_url",
                        "audio_url": {
                            # Any format supported by librosa is supported
                            "url": audio_url
                        },
                    },
                ],
            }
        ],
        model=model,
        max_completion_tokens=64,
    )

    result = chat_completion_from_url.choices[0].message.content
    print("Chat completion output from audio url:", result)

    # base64 URL
    chat_completion_from_base64 = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What's in this audio?"},
                    {
                        "type": "audio_url",
                        "audio_url": {
                            # Any format supported by librosa is supported
                            "url": f"data:audio/ogg;base64,{audio_base64}"
                        },
                    },
                ],
            }
        ],
        model=model,
        max_completion_tokens=64,
    )

    result = chat_completion_from_base64.choices[0].message.content
    print("Chat completion output from base64 encoded audio:", result)


example_function_map = {
    "text-only": run_text_only,
    "single-image": run_single_image,
    "multi-image": run_multi_image,
    "video": run_video,
    "audio": run_audio,
}


def parse_args():
    parser = FlexibleArgumentParser(
        description="Demo on using OpenAI client for online serving with "
        "multimodal language models served with vLLM."
    )
    parser.add_argument(
        "--chat-type",
        "-c",
        type=str,
        default="single-image",
        choices=list(example_function_map.keys()),
        help="Conversation type with multimodal data.",
    )
    return parser.parse_args()


def main(args) -> None:
    chat_type = args.chat_type
    model = get_first_model(client)
    example_function_map[chat_type](model)


if __name__ == "__main__":
    args = parse_args()
    start_time = time.time()
    main(args)
    end_time = time.time()
    print(f"Time taken: {end_time - start_time} seconds")