(Experimental) LMCache Multi-process Mode#

LMCache multi-process mode allows you to run LMCache as a separate service in a different process.

During runtime, vLLM instances will establish a connection to the LMCache process and send requests to it.

In the future, each GPU node will have a single LMCache process running in multi-process mode. These LMCache processes will be interconnected to form a distributed KV cache service.

Note

This is an experimental feature and is under active development. Please expect breaking changes in the future.

Note

Currently, the multi-process mode only supports CPU offloading without eviction. It is not recommended for production use.

Prerequisites#

  • vLLM version >= 0.11.1

  • LMCache latest dev branch

Quick Start#

Step 1: Start the LMCache server

Run the following command to start the LMCache server with a 100 GB CPU buffer:

python3 -m lmcache.v1.multiprocess.server --cpu-buffer-size 100

You should see the following log output:

[2025-11-19 21:20:58,901] LMCache INFO: LMCache cache server is running... (server.py:483:__main__)

Note

The default port for LMCache is 5555. It will accept connections from vLLM instances on this port.

Step 2: Start vLLM with LMCacheMPConnector

In a new terminal window, start vLLM with the LMCache connector:

vllm serve Qwen/Qwen3-14B \
    --kv-transfer-config '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'

You should see the following logs on the vLLM side:

(EngineCore_DP0 pid=3086423) [2025-11-19 23:10:25,072] LMCache INFO: Registering kv caches! (lmcache_mp_connector.py:405:vllm.distributed.kv_transfer.kv_connector.v1.lmcache_mp_connector)
(EngineCore_DP0 pid=3086423) [2025-11-19 23:10:25,072] LMCache INFO: Registering kv caches (multi_process_adapter.py:205:vllm.distributed.kv_transfer.kv_connector.v1.lmcache_integration.multi_process_adapter)

You should also see the following logs on the LMCache side:

[2025-11-19 23:10:25,084] LMCache INFO: Registered KV cache for GPU ID 3086423 with 40 layers (server.py:215:__main__)

Step 3: Send requests to the vLLM instance

Send a test request with a repeated prompt to demonstrate caching:

curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d "{
    \"model\": \"Qwen/Qwen3-14B\",
    \"prompt\": \"$(printf 'Explain the significance of KV cache in language models.%.0s' {1..100})\",
    \"max_tokens\": 10
}"

On the first request, you should see the following logs on the LMCache side, indicating that tokens were stored in the cache:

[2025-11-19 23:24:39,547] LMCache INFO: Stored 768 tokens in 0.001 seconds (server.py:299:__main__)

If you send the same request again, you should see the following logs, indicating that tokens were retrieved from the cache:

[2025-11-19 23:24:47,312] LMCache INFO: Retrieved 768 tokens in 0.001 seconds (server.py:370:__main__)

Detailed Configuration#

Server Configuration#

The LMCache multi-process server supports the following command-line arguments:

  • --host: Host address to bind the server (default: localhost)

  • --port: Port number to bind the server (default: 5555)

  • --chunk-size: Chunk size for KV cache operations in tokens (default: 256)

  • --cpu-buffer-size: CPU buffer size in GB for caching (default: 5.0)

  • --max-workers: Maximum number of worker threads for handling requests (default: 1)

Example with custom configuration:

python3 -m lmcache.v1.multiprocess.server \
    --host 0.0.0.0 \
    --port 6000 \
    --chunk-size 512 \
    --cpu-buffer-size 50.0 \
    --max-workers 4

vLLM Client Configuration#

On the vLLM side, you can specify the host and port of the LMCache server through the kv_connector_extra_config parameter:

vllm serve Qwen/Qwen3-14B \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both", "kv_connector_extra_config": {"lmcache.mp.host": "127.0.0.1", "lmcache.mp.port": 6000}}'

Future Work#

  • Thread-safe memory allocator and storage manager.

  • Eviction policy.

  • Plugin the current storage backends.

  • Potential performance improvements (double buffering, new kernels, etc.).

  • Lock and unlock semantics in new storage manager.

  • Distributed mode with sharding.