LMCache Engine#

The LMCache Engine has two mean function : store and retrieve.

The store Function :

Breaks input tokens and KV caches into manageable chunks.
Store the chunks into dictionary, which is managed by the LMCache Backend
Uses prefix token hashes (Currently sha256) to index the chunks, combined with other arguments (e.g., format) to form the key.
Efficiently stores the KV caches while avoiding redundancy (if skip_existing=True).
The store operation can be blocking or non-blocking depending on the blocking argument.

The retrieve Function:

Retrieves KV caches for the input tokens, using the same chunking and hashing mechanism as store.
Supports partial retrieval via the mask parameter, allowing retrieval of suffixes or specific portions of the token sequence.
Concatenates the retrieved KV cache chunks into a usable format for model inference.

The details of LMCacheEngine class are listed below.

class lmcache.cache_engine.LMCacheEngine(config: LMCacheEngineConfig, metadata: LMCacheEngineMetadata)[source]#

_blob_to_tuple_kv(blob: Tensor) → Tuple[Tuple[Tensor, Tensor], ...][source]#: Convert a single big tensor to the nested tuple of kv tensors

_chunk_kv(kv_tensors: Tensor, fmt: str) → Iterable[Tensor][source]#

Chunk the kv cache into chunks of size self.chunk_size.

Parameters:

tokens – the input tokens, with shape [seq_len]
kv_tensors – the kv cache of the tokens, in the format of nested tuples
fmt – either ‘huggingface’ or ‘vllm’

Returns:

a generator of tuples, each tuple is a chunk of tokens and the corresponding kv cache.

_chunk_tokens(tokens: Tensor) → Iterable[Tensor][source]#

Chunk the tokens into chunks of size self.chunk_size.

Parameters:: tokens – the input tokens, with shape [seq_len] device: the target device after chunking
Returns:: a generator of chunks of tokens, each with shape [chunk_size]

_get_init_hash() → str[source]#

_hash(tokens: Tensor, prefix_hash: str) → str[source]#

_make_chunks(tokens: Tensor, kv_tensors: Tensor, fmt: str, num_skip_prefix_chunk=0, skip_existing=True) → Iterable[Tuple[str, Tensor]][source]#: Returns a generator of zipped (chunk_hash, chunk_kv) tuples

_make_chunks_skip_existing(tokens: Tensor, kv_tensors: Tensor, fmt: str, num_skip_prefix_chunk=0) → Iterable[Tuple[str, Tensor]][source]#: Skip the existing chunks and return the rest of the chunks

_make_key(chunk_hash: str, fmt: str) → CacheEngineKey[source]#

_num_tokens_in_kv(kv_tensors: Tuple[Tuple[Tensor, Tensor], ...] | Tensor, fmt: str) → int[source]#

_prefix_hash(token_chunks: Iterable[Tensor], num_skip_chunk: int | None = 0) → List[str][source]#

_slice_kv_at(start_idx: int, kv_tensors: Tensor, fmt: str) → List[Tensor][source]#: vllm format: [num_layer, 2, num_tokens, num_kv_head, head_size] huggingface format: [num_layer, 2, num_kv_head, num_tokens, head_size]

_tuple_kv_to_blob(kv_tensors: Tuple[Tuple[Tensor, Tensor], ...]) → Tensor[source]#: Convert the nested tuple of kv tensors to a single big tensor with 2 extra dimensions

close()[source]#

lookup(tokens: Tensor) → int[source]#

Checks the existence of KV cache of the tokens from the cache engine.

Parameters:: tokens – the input tokens, with shape [seq_len]
Returns:: An int indicating how many prefix tokens are cached.

retrieve(tokens: Tensor, mask: Tensor | None = None, return_tuple: bool = True) → Tuple[Tuple[Tuple[Tensor, Tensor], ...] | Tensor, Tensor][source]#

Retrieve the KV cache of the tokens from the cache engine. The retrieved KV cache should be a prefix of the input tokens.

The KV cache of the tokens, in the format of nested tuples or a single tensor with shape [num_layers, 2, hidden_dim, num_tokens] (huggingface) or [num_layers, 2, num_tokens, hidden_dim] (vllm).

Will be an empty tuple if no kv cache is retrieved (no matter return_tuple is True or not).

Parameters:

tokens – the input tokens, with shape [seq_len]
mask – a boolean mask of tokens indicating which tokens’ KV Cache should be retrieved. Currently, only support suffix mask.
return_tuple – whether to return the kv cache as a tuple or a single tensor

Returns:

Tuple[ kv_tensors , ret_mask] indicate which tokens are retrieved

store(tokens: Tensor, kv_tensors_raw: Tuple[Tuple[Tensor, Tensor], ...], kv_tensors_mask: Tensor | None = None, skip_existing=True, blocking=True) → None[source]#

Store the KV cache of the tokens into the cache engine. Format: either ‘huggingface’ or ‘vllm’

For huggingface, it should have the shape of [num_heads, num_tokens, head_size]

For vllm, it should have the shape of [num_tokens, num_heads, head_size]

Parameters:

tokens – the input tokens, with shape [seq_len]
kv_tensors_raw – the kv cache of the tokens, in the format of nested tuples. The number of tokens in the kv_tensors_raw should be the same as trues in kv_tensors_mask if mask is not None. Otherwise, it should be the same as the input tokens.
kv_tensors_mask – a boolean mask of tokens indicating which tokens’ KV Cache should be stored. Only support suffix mask. None is taken as trues for all tokens. len(kv_tensors_mask) should be the same as len(tokens) number of true should be the same as kv_tensors_raw token number.
skip_existing – whether to skip the existing chunks
blocking – whether to wait for the store operation to finish

Returns:

None

Note

The KV cache should NOT have the “batch” dimension.

LLM Engine

LMCache Backend