Kubernetes Deployment#

For Kubernetes deployment of vLLM with LMCache integration, we recommend using the vLLM Production Stack project. This is a specialized production-ready implementation for K8S-native cluster-wide deployment for vllm & lmcache.

For a quick start guide, please refer to the official documentation

and replace the Helm values file with (values-05-cpu-offloading.yaml):

servingEngineSpec:
  runtimeClassName: ""
  modelSpec:
  - name: "mistral"
    repository: "lmcache/vllm-openai"
    tag: "latest"
    modelURL: "mistralai/Mistral-7B-Instruct-v0.2"
    replicaCount: 1
    requestCPU: 10
    requestMemory: "40Gi"
    requestGPU: 1
    pvcStorage: "50Gi"
    pvcAccessMode:
      - ReadWriteOnce
    vllmConfig:
      maxModelLen: 32000

    lmcacheConfig:
      enabled: true
      cpuOffloadingBufferSize: "20"

    hf_token: <hf-token>

OR

refer to a detailed step-by-step tutorial on how to offload KV cache with LMCache in the production stack.