Maru#

Overview#

Maru is a high-performance KV cache storage engine built on CXL shared memory, designed for LLM inference scenarios where multiple instances need to share a KV cache with minimal latency.

For architecture details, see the Maru documentation.

Quick Start#

Install Maru:

git clone https://github.com/xcena-dev/maru.git
cd maru
./install.sh

This installs maru-server, maru-resourced, and the maru Python package.

Deploy Model With Maru#

Prerequisites: CXL device (/dev/dax*), Python 3.12+, vLLM and LMCache installed.

1. Start the Maru Server

maru-server

2. Create configuration file (maru-config.yaml):

chunk_size: 256
local_cpu: False
max_local_cpu_size: 0
save_unfull_chunk: True

# Maru backend
maru_path: "maru://localhost:5555"
maru_pool_size: 4

3. Start vLLM with Maru

LMCACHE_CONFIG_FILE="maru-config.yaml" \
vllm serve \
    meta-llama/Llama-3.1-8B-Instruct \
    --max-model-len 65536 \
    --kv-transfer-config \
    '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'

Configuration#

LMCache Parameters:

Parameter	Default	Description
`maru_path`	Required	Maru server URL (format: `maru://host:port`)
`maru_pool_size`	`4.0`	CXL memory pool size per instance in GB (e.g., `4`, `0.5`)

Advanced Parameters (via extra_config):

Parameter	Default	Description
`maru_instance_id`	auto UUID	Unique client instance identifier
`maru_timeout_ms`	5000	ZMQ RPC socket timeout in milliseconds
`maru_use_async_rpc`	true	Async DEALER-ROUTER RPC (`false` for synchronous REQ-REP)
`maru_max_inflight`	64	Max concurrent async RPC requests
`maru_eager_map`	true	Pre-map all shared regions on connect

Maru#

Overview#

Quick Start#

Deploy Model With Maru#

Configuration#

Additional Resources#