LMCache Blend#

Submodules#

lmcache.blend.executor module#

class lmcache.blend.executor.CacheBlendImpl(recompute_ratio: float, all_reduce_function=None)[source]#

Bases: BlendExecutor

blend(layer_id: int, retrieved_k: Tensor, retrieved_v: Tensor, valid_mask: Tensor, original_positions: Tensor, fresh_q: Tensor, fresh_k: Tensor, fresh_v: Tensor, positions: Tensor, query_start_loc: Tensor, token_dim: int) → BlendOutput[source]#

This function blends the retrieved KV with fresh KVs, and returns the short Q + long KV (blended) + positions of the tokens in Q

Parameters:

layer_id (int) – The layer id
retrieved_k (torch.Tensor) – The retrieved K layer, in shape [num_tokens, hidden_dims]
retrieved_v (torch.Tensor) – The retrieved V layer, in shape [num_tokens, hidden_dims]
valid_mask (torch.Tensor) – A CPU tensor returned from the retriever indicating whether the KV is valid.
original_positions (torch.Tensor) – The original positions of the tokens in the retrieved KV
fresh_q (torch.Tensor) – The fresh Q tensor from QKV split, in shape [num_tokens, hidden_dims]
fresh_k (torch.Tensor) – The fresh K tensor from QKV split, in shape [num_tokens, hidden_dims]
fresh_v (torch.Tensor) – The fresh V tensor from QKV split, in shape [num_tokens, hidden_dims]
positions (torch.Tensor) – The positions in the input of the tokens in the fresh_q
query_start_loc (torch.Tensor) – The start location of the query if input_tokens has multiple requests in a batch. The length should be the number of requests in the batch + 1. Note this will NOT be changed after token selection.
token_dim (int) – The token dimension

Returns:

The blended Q, K, V, and positions

set_positional_encoder(positional_encoder: Callable[[Tensor, Tensor, Tensor], Tuple[Tensor, Tensor]])[source]#

set_reverse_positional_encoder(reverse_positional_encoder: Callable[[Tensor, Tensor, Tensor], Tuple[Tensor, Tensor]])[source]#

lmcache.blend.executor.create_index(ndims, target_dim, index)[source]#

lmcache.blend.executor.indices_to_mask(indices, size)[source]#

lmcache.blend.executor.mask_to_indices(mask)[source]#

lmcache.blend.interfaces module#

class lmcache.blend.interfaces.BlendExecutor[source]#

The interface for the cacheblend executor to blend the retrieved KV with fresh KVs

abstract blend(layer_id: int, retrieved_k: Tensor, retrieved_v: Tensor, valid_mask: Tensor, original_positions: Tensor, fresh_q: Tensor, fresh_k: Tensor, fresh_v: Tensor, positions: Tensor, query_start_loc: Tensor, token_dim: int) → BlendOutput[source]#

This function blends the retrieved KV with fresh KVs, and returns the short Q + long KV (blended) + positions of the tokens in Q

Parameters:

layer_id (int) – The layer id
retrieved_k (torch.Tensor) – The retrieved K tensor
retrieved_v (torch.Tensor) – The retrieved V tensor
valid_mask (torch.Tensor) – A CPU tensor returned from the retriever indicating whether the KV is valid.
original_positions (torch.Tensor) – The original positions of the tokens in the retrieved KV
fresh_q (torch.Tensor) – The fresh Q tensor from QKV split
fresh_k (torch.Tensor) – The fresh K tensor from QKV split
fresh_v (torch.Tensor) – The fresh V tensor from QKV split
positions (torch.Tensor) – The positions in the input of the tokens in the fresh_q
query_start_loc (torch.Tensor) – The start location of the query if input_tokens has multiple requests in a batch. The length should be the number of requests in the batch + 1
token_dim (int) – The token dimension

Returns:

The blended Q, K, V, and positions

class lmcache.blend.interfaces.BlendOutput(q: Tensor, k: Tensor, v: Tensor, positions: Tensor, local_indices: Tensor, query_start_loc: Tensor | None)[source]#

The output of the cacheblend module

Variables:

q (torch.Tensor) – The short Q tensor with selected tokens
k (torch.Tensor) – The long K tensor with the updated values
v (torch.Tensor) – The long V tensor with the updated values
positions (torch.Tensor) – The positions of the selected Q tokens in the input sequence
local_indices (torch.Tensor) – The positions of the selected Q tokens in fresh q
query_start_loc (Optional[torch.Tensor]) – The modified query_start_loc if token selection has happened. Will be None if no selection has happened.

k: Tensor#

local_indices: Tensor#

positions: Tensor#

q: Tensor#

query_start_loc: Tensor | None#

v: Tensor#

class lmcache.blend.interfaces.BlendRetriever[source]#

The interface for the cacheblend retriever to retrieve the KV caches

It takes in input tokens and ROI as input, and launch some tasks (maybe async), and return a BlendRetrieverTask to retrieve the KV caches.

abstract new_request(full_prompts: List[Tensor], indices: List[List[int]]) → BlendRetrieverTask[source]#

Create a new BlendRetrieverTask to retrieve the KV caches. It may launch async tasks in the background during the retrieval.

Parameters:: full_prompts (List[torch.Tensor]) – The full prompts for each

request in this batch. :param List[List[int]] indices: The indices of where the segmengted requests start in the full prompts.

Returns:: The retriever task to retrieve the KV caches
Return type:: BlendRetrieverTask

class lmcache.blend.interfaces.BlendRetrieverResult(k: Tensor | None, v: Tensor | None, valid_mask: Tensor, original_positions: Tensor)[source]#

The result of the cacheblend retriever

Variables:

k (torch.Tensor) – The K tensor of a single layer, will be None if nothing is retrieved
v (torch.Tensor) – The V tensor of a single layer, will be None if nothing is retrieved
valid_mask (torch.Tensor) – The valid mask on CPU
original_positions (torch.Tensor) – The original positions of the retrieved KV in the input sequence. If the corresponding KV is not valid, the position will be 0. This tensor will be on the same device as K and V.

k: Tensor | None#

original_positions: Tensor#

v: Tensor | None#

valid_mask: Tensor#

class lmcache.blend.interfaces.BlendRetrieverTask[source]#

The KV retrieval task created by the BlendRetriever

abstract result(layer_id: int) → BlendRetrieverResult[source]#

Blocking function to get a single layer of K and V tensor. The returned the K and V tensor should match the length of the input tokens passed to the BlendRetriever.new_request function. If the KV of a token is not available, the vaild_mask will be 0, and the corresponding values in the KV tensor will be undefined.

Parameters:: layer_id (int) – the layer id
Returns:: The BlendRetrieverResult object
Return type:: BlendRetrieverResult

lmcache.blend.retriever module#

class lmcache.blend.retriever.SPTBlendRetriever(cache_engine: LMCacheEngine, metadata: LMCacheEngineMetadata)[source]#

Bases: BlendRetriever

Implement the retrieval logic using “SPecial Token” (SPT) as delimiter.

This implementation assumes that there MUST be a special token at the end of the input text chunk.

Example

Input = [x, x, x, spt, y, y, spt, z, z, z, z]

Requests sent to LMCache engine when using drop_spt_and_get_indices and new_request: - [x, x, x] - [y, y] - [z, z, z, z]

Therefore, to use this retriever, the text chunks are better to also be ended with the special token.

new_request(full_prompts: List[Tensor], indices: List[List[int]]) → BlendRetrieverTask[source]#

Create a new BlendRetrieverTask to retrieve the KV caches. It may launch async tasks in the background during the retrieval.

Parameters:: full_prompts (List[torch.Tensor]) – The full prompts for each

request in this batch, which will contain the tokens hitting the vLLM’s internal prefix caching. :param List[List[int]] indices: The indices of where the segmengted requests start in the full prompts.

Returns:: The retriever task to retrieve the KV caches
Return type:: BlendRetrieverTask

class lmcache.blend.retriever.SPTBlendRetrieverTask(token_segments: List[Tensor], tasks: List[Future], fmt: str)[source]#

Bases: BlendRetrieverTask

result(layer_id: int) → BlendRetrieverResult[source]#

Blocking function to get a single layer of K and V tensor. The returned the K and V tensor should match the length of the input tokens passed to the BlendRetriever.new_request function.

Parameters:: layer_id (int) – the layer id
Returns:: Tuple of K and V tensor
Return type:: Tuple[torch.Tensor, torch.Tensor]

Module contents#

LMCache Backend

LMCache Server