RESP (Native Redis/Valkey)#
Overview#
The RESP backend is a high-performance native C++ storage connector for Redis and Valkey servers, using the RESP2 wire protocol over TCP. It is designed for maximum throughput on KV cache store and retrieval operations, achieving 6+ GB/s on reads with optimal configuration.
Key advantages over the standard Redis connector:
Multi-threaded C++ I/O: Worker threads operate in parallel with zero-copy buffer passing and full GIL release
Batched tiling: Large batch operations are automatically split across worker threads for maximum parallelism
eventfd-based completions: The kernel wakes Python on completion – no polling overhead
Dual-mode support: The same C++ connector works in both non-MP mode (via
ConnectorClientBase) and MP mode (viaNativeConnectorL2Adapteras an L2 adapter)
The native C++ source lives in csrc/storage_backends/redis/. See Adding Native Connectors for the full architecture.
Prerequisites#
LMCache installed from source (
pip install -e .) to compile the C++ extensionA Redis 8.2+ or Valkey server (Redis 8.2 recommended for IO threads support)
A machine with at least one GPU for vLLM inference
Redis Server Setup#
Important
Redis version and server configuration have a major impact on throughput. Using Redis 8.2 with IO threads yields ~6 GB/s reads vs ~1.5 GB/s with Redis 6.0 defaults.
Build Redis 8.2 from source (recommended):
git clone https://github.com/redis/redis.git
cd redis
git checkout 8.2
make -j
Start the server with IO threads enabled:
./src/redis-server \
--protected-mode no \
--save '' \
--appendonly no \
--io-threads 4 \
--port 6379
Flag |
Why |
|---|---|
|
Allow connections from other hosts (use auth in production) |
|
Disable persistence – KV cache is ephemeral, persistence wastes bandwidth |
|
Enable multi-threaded I/O for parallel read/write handling |
|
Default port (adjust if running multiple instances) |
Tip
The number of --io-threads should roughly match the number of physical cores
available to the Redis process. 4 is a good starting point; benchmark with your
hardware to find the optimum.
Chunk Size Selection and Throughput Tuning#
The chunk size (in tokens) determines how many bytes each Redis key-value pair holds. This is the single most important parameter for throughput.
The sweet spot is ~4 MB per chunk. Both smaller and larger chunks degrade throughput:
Chunk Size |
Total Data |
SET Throughput |
GET Throughput |
|---|---|---|---|
1 MB (500 keys) |
500 MB |
~3.5 GB/s |
~5.2 GB/s |
4 MB (500 keys) |
2 GB |
~4.4 GB/s |
~5.9 GB/s |
8 MB (200 keys) |
1.6 GB |
~4.2 GB/s |
~1.4 GB/s |
Why 4 MB?
Below ~2 MB, per-key overhead (RESP framing, TCP round-trips) dominates
Above ~4 MB, Redis server-side memory allocation and TCP window sizes become bottlenecks
At 4 MB, the balance between amortized overhead and memory pressure is optimal
Calculating chunk size in tokens:
The chunk size in bytes depends on the model’s hidden dimension, number of KV heads, number of layers, and dtype:
bytes_per_token = 2 * num_kv_heads * head_dim * num_layers * dtype_bytes
For meta-llama/Llama-3.1-8B-Instruct with BFloat16:
bytes_per_token = 2 * 8 * 128 * 32 * 2 = 131,072 bytes (~128 KB)
chunk_size_tokens = 4 MB / 128 KB = 32 tokens
# But typically chunk_size is set as token count in config:
chunk_size: 16 # ~2 MB per chunk (conservative)
chunk_size: 32 # ~4 MB per chunk (optimal for throughput)
Note
The bytes-per-token calculation varies by model architecture. Larger models (e.g., 70B) have more layers and larger hidden dimensions, so fewer tokens are needed per chunk to reach the 4 MB sweet spot.
Throughput Sweep#
To find the optimal configuration for your hardware, use the included benchmark:
cd examples/kv_cache_reuse/remote_backends/resp
# Sweep chunk sizes
for mb in 0.5 1 2 4 8; do
echo "=== Chunk: ${mb} MB ==="
python benchmark_resp_client.py \
--host 127.0.0.1 --port 6379 \
--chunk-mb $mb --num-workers 8 --num-keys 500
done
# Sweep worker counts
for w in 1 2 4 8 16; do
echo "=== Workers: $w ==="
python benchmark_resp_client.py \
--host 127.0.0.1 --port 6379 \
--chunk-mb 4 --num-workers $w --num-keys 500
done
Expected output:
Redis RESP Client Benchmark
Server: 127.0.0.1:6379, Workers: 8
Chunk size: 4096KB, Keys: 500
------------------------------------------------------------
Batch SET: 4.36 GB/s (1.95 GB written)
Batch GET: 5.91 GB/s (1.95 GB read)
Batch EXISTS: 143528 ops/s (500/500 hits)
------------------------------------------------------------
All tests passed
Environment Variable Configuration#
Sensitive credentials (and optionally host/port) can be provided via environment variables instead of config files or CLI arguments. This prevents secrets from appearing in logged configuration at startup.
Variable |
Description |
|---|---|
|
Redis ACL username. Used as default when |
|
Redis AUTH password. Used as default when |
|
Redis hostname or IP. Used as default when |
|
Redis port. Used as default when |
Config files (non-MP) and --l2-adapter JSON (MP) take precedence over
environment variables. Environment variables serve as defaults — they are used
when the corresponding config value is empty or unset. They are read at adapter
creation time inside the adapter itself, so they are never stored in the
config object and never printed in startup logs.
Example — MP mode with env vars:
export LMCACHE_RESP_USERNAME="default"
export LMCACHE_RESP_PASSWORD="secret"
lmcache server \
--l1-size-gb 10 \
--eviction-policy LRU \
--chunk-size 16 \
--l2-adapter '{"type": "resp", "host": "localhost", "port": 6379, "num_workers": 8}' \
--port 6555
Example — Non-MP mode with env vars:
export LMCACHE_RESP_USERNAME="default"
export LMCACHE_RESP_PASSWORD="secret"
LMCACHE_CONFIG_FILE=resp-config.yaml \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' \
--no-enable-prefix-caching \
--load-format dummy
Tip
For production deployments, always use environment variables for credentials rather than embedding them in config files or CLI arguments.
Non-MP Mode (Single Process)#
In non-MP mode, the RESP connector is used directly as a remote storage backend
via the RESPClient asyncio wrapper.
Configuration file (resp-config.yaml):
chunk_size: 16
remote_url: "resp://localhost:6379"
remote_serde: "naive"
Credentials can be set via environment variables (recommended) or in the config file
under extra_config (see Environment Variable Configuration above).
Launch vLLM:
LMCACHE_CONFIG_FILE=resp-config.yaml \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' \
--no-enable-prefix-caching \
--load-format dummy
Note
save_unfull_chunk must be off (default) and chunk metadata saving must be
disabled for optimal throughput with the native RESP connector.
MP Mode (Multiprocess)#
In MP mode, LMCache runs as a separate server process communicating with vLLM over ZMQ. The RESP connector serves as an L2 adapter with variable-size chunk support.
Step 1: Start Redis (see Redis Server Setup above)
Step 2: Launch LMCache MP Server:
lmcache server \
--l1-size-gb 10 \
--eviction-policy LRU \
--chunk-size 16 \
--l2-adapter '{"type": "resp", "host": "localhost", "port": 6379, "num_workers": 8}' \
--port 6555
Step 3: Launch vLLM with LMCache MP Connector:
PORT=8000
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--kv-transfer-config '{
"kv_connector": "LMCacheMPConnector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"lmcache.mp.host": "tcp://localhost",
"lmcache.mp.port": 6555
}
}' \
--no-enable-prefix-caching \
--port $PORT \
--load-format dummy
L2 Adapter Configuration#
The --l2-adapter JSON accepts these fields:
Field |
Type |
Default |
Description |
|---|---|---|---|
|
str |
(required) |
Must be |
|
str |
(required) |
Redis/Valkey hostname or IP |
|
int |
(required) |
Redis/Valkey port |
|
int |
8 |
C++ worker threads for parallel I/O |
|
str |
|
Redis ACL username (leave empty for no auth). Falls back to |
|
str |
|
Redis AUTH password (leave empty for no auth). Falls back to |
|
float |
0 |
Maximum L2 storage capacity in GB for client-side usage tracking. Required for L2 eviction. Set to 0 (default) to disable usage tracking. |
L2 Eviction#
To enable automatic eviction of least-recently-used keys when the Redis backend fills up,
set max_capacity_gb and add an "eviction" block:
lmcache server \
--l1-size-gb 10 \
--eviction-policy LRU \
--chunk-size 16 \
--l2-adapter '{
"type": "resp",
"host": "localhost",
"port": 6379,
"num_workers": 8,
"max_capacity_gb": 10,
"eviction": {
"eviction_policy": "LRU",
"trigger_watermark": 0.8,
"eviction_ratio": 0.2
}
}' \
--port 6555
This configures a 10 GB capacity limit. When usage exceeds 80% (trigger_watermark),
the eviction controller will delete the least-recently-used ~20% of stored keys
(eviction_ratio) using the Redis DEL command.
Note
max_capacity_gb enables client-side size tracking. It does not configure
the Redis server’s maxmemory setting. You should set max_capacity_gb to
match or be slightly below your Redis server’s available memory.
Testing the Setup#
Send the same prompt twice. The first request stores KV cache to Redis; the second retrieves it.
PORT=8000
PROMPT="$(printf 'Elaborate the significance of KV cache in language models. %.0s' {1..1000})"
# First request: store
curl -s -X POST http://localhost:${PORT}/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"'"$PROMPT"'","max_tokens":10}'
# Second request with same prefix: retrieve from Redis
curl -s -X POST http://localhost:${PORT}/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"'"$PROMPT"'","max_tokens":10}'
Verify data was stored:
redis-cli -p 6379 DBSIZE
Clear state between runs:
redis-cli -p 6379 FLUSHALL
Best Practices#
Server deployment:
Use Redis 8.2+ with
--io-threads 4(or more, matching available cores)Disable persistence (
--save '' --appendonly no) for KV cache workloadsPin Redis to its own NUMA node if running on multi-socket systems
For production, enable authentication with
--requirepassand supply credentials viaLMCACHE_RESP_USERNAME/LMCACHE_RESP_PASSWORDenvironment variables to keep them out of logs
Client tuning:
Start with
num_workers: 8and increase if the server has spare CPU and you’re not saturating the networkMore workers help when chunk sizes are smaller (more keys per batch = more parallelism needed)
On NUMA systems, ensure the LMCache process runs on the same NUMA node as the NIC
Chunk size:
Target ~4 MB per chunk for maximum throughput
Calculate the token count using your model’s per-token byte size (see formula above)
If unsure, run the benchmark sweep to find the optimum for your specific hardware
Network:
Use localhost or loopback for single-machine deployments
For cross-machine setups, ensure low-latency networking (ideally <100 us RTT)
The RESP connector uses TCP; RDMA is not currently supported (consider Mooncake for RDMA)
Additional Resources#
Benchmark script:
examples/kv_cache_reuse/remote_backends/resp/benchmark_resp_client.pyC++ source:
csrc/storage_backends/redis/Native connector architecture:
csrc/storage_backends/README.mdDeveloper guide for adding new native connectors: Adding Native Connectors