LMCache Blend#

Submodules#

lmcache.blend.executor module#

class lmcache.blend.executor.CacheBlendImpl(recompute_ratio: float)[source]#

Bases: BlendExecutor

blend(layer_id: int, retrieved_k: Tensor, retrieved_v: Tensor, valid_mask: Tensor, original_positions: Tensor, fresh_q: Tensor, fresh_k: Tensor, fresh_v: Tensor, positions: Tensor, query_start_loc: Tensor, token_dim: int) BlendOutput[source]#

This function blends the retrieved KV with fresh KVs, and returns the short Q + long KV (blended) + positions of the tokens in Q

Parameters:
  • layer_id (int) – The layer id

  • retrieved_k (torch.Tensor) – The retrieved K layer, in shape [num_tokens, hidden_dims]

  • retrieved_v (torch.Tensor) – The retrieved V layer, in shape [num_tokens, hidden_dims]

  • valid_mask (torch.Tensor) – A CPU tensor returned from the retriever indicating whether the KV is valid.

  • original_positions (torch.Tensor) – The original positions of the tokens in the retrieved KV

  • fresh_q (torch.Tensor) – The fresh Q tensor from QKV split, in shape [num_tokens, hidden_dims]

  • fresh_k (torch.Tensor) – The fresh K tensor from QKV split, in shape [num_tokens, hidden_dims]

  • fresh_v (torch.Tensor) – The fresh V tensor from QKV split, in shape [num_tokens, hidden_dims]

  • positions (torch.Tensor) – The positions in the input of the tokens in the fresh_q

  • query_start_loc (torch.Tensor) – The start location of the query if input_tokens has multiple requests in a batch. The length should be the number of requests in the batch + 1. Note this will NOT be changed after token selection.

  • token_dim (int) – The token dimension

Returns:

The blended Q, K, V, and positions

set_positional_encoder(positional_encoder: Callable[[Tensor, Tensor, Tensor], Tuple[Tensor, Tensor]])[source]#
set_reverse_positional_encoder(reverse_positional_encoder: Callable[[Tensor, Tensor, Tensor], Tuple[Tensor, Tensor]])[source]#
lmcache.blend.executor.create_index(ndims, target_dim, index)[source]#
lmcache.blend.executor.indices_to_mask(indices, size)[source]#
lmcache.blend.executor.mask_to_indices(mask)[source]#

lmcache.blend.interfaces module#

class lmcache.blend.interfaces.BlendExecutor[source]#

The interface for the cacheblend executor to blend the retrieved KV with fresh KVs

abstract blend(layer_id: int, retrieved_k: Tensor, retrieved_v: Tensor, valid_mask: Tensor, original_positions: Tensor, fresh_q: Tensor, fresh_k: Tensor, fresh_v: Tensor, positions: Tensor, query_start_loc: Tensor, token_dim: int) BlendOutput[source]#

This function blends the retrieved KV with fresh KVs, and returns the short Q + long KV (blended) + positions of the tokens in Q

Parameters:
  • layer_id (int) – The layer id

  • retrieved_k (torch.Tensor) – The retrieved K tensor

  • retrieved_v (torch.Tensor) – The retrieved V tensor

  • valid_mask (torch.Tensor) – A CPU tensor returned from the retriever indicating whether the KV is valid.

  • original_positions (torch.Tensor) – The original positions of the tokens in the retrieved KV

  • fresh_q (torch.Tensor) – The fresh Q tensor from QKV split

  • fresh_k (torch.Tensor) – The fresh K tensor from QKV split

  • fresh_v (torch.Tensor) – The fresh V tensor from QKV split

  • positions (torch.Tensor) – The positions in the input of the tokens in the fresh_q

  • query_start_loc (torch.Tensor) – The start location of the query if input_tokens has multiple requests in a batch. The length should be the number of requests in the batch + 1

  • token_dim (int) – The token dimension

Returns:

The blended Q, K, V, and positions

class lmcache.blend.interfaces.BlendOutput(q: Tensor, k: Tensor, v: Tensor, positions: Tensor, local_indices: Tensor, query_start_loc: Tensor | None)[source]#

The output of the cacheblend module

Variables:
  • q (torch.Tensor) – The short Q tensor with selected tokens

  • k (torch.Tensor) – The long K tensor with the updated values

  • v (torch.Tensor) – The long V tensor with the updated values

  • positions (torch.Tensor) – The positions of the selected Q tokens in the input sequence

  • local_indices (torch.Tensor) – The positions of the selected Q tokens in fresh q

  • query_start_loc (Optional[torch.Tensor]) – The modified query_start_loc if token selection has happened. Will be None if no selection has happened.

k: Tensor#
local_indices: Tensor#
positions: Tensor#
q: Tensor#
query_start_loc: Tensor | None#
v: Tensor#
class lmcache.blend.interfaces.BlendRetriever[source]#

The interface for the cacheblend retriever to retrieve the KV caches

It takes in input tokens and ROI as input, and launch some tasks (maybe async), and return a BlendRetrieverTask to retrieve the KV caches.

abstract new_request(input_tokens: Tensor, query_start_loc: Tensor) BlendRetrieverTask[source]#

Create a new BlendRetrieverTask to retrieve the KV caches. It may launch async tasks in the background during the retrieval.

Parameters:
  • input_tokens (torch.Tensor) – The input tokens, could include multiple requests in a batch

  • query_start_loc (torch.Tensor) – The start location of the query if input_tokens has multiple requests in a batch. The length should be the number of requests in the batch + 1

Returns:

The retriever task to retrieve the KV caches

Return type:

BlendRetrieverTask

class lmcache.blend.interfaces.BlendRetrieverResult(k: Tensor | None, v: Tensor | None, valid_mask: Tensor, original_positions: Tensor)[source]#

The result of the cacheblend retriever

Variables:
  • k (torch.Tensor) – The K tensor of a single layer, will be None if nothing is retrieved

  • v (torch.Tensor) – The V tensor of a single layer, will be None if nothing is retrieved

  • valid_mask (torch.Tensor) – The valid mask on CPU

  • original_positions (torch.Tensor) – The original positions of the retrieved KV in the input sequence. If the corresponding KV is not valid, the position will be 0. This tensor will be on the same device as K and V.

k: Tensor | None#
original_positions: Tensor#
v: Tensor | None#
valid_mask: Tensor#
class lmcache.blend.interfaces.BlendRetrieverTask[source]#

The KV retrieval task created by the BlendRetriever

abstract result(layer_id: int) BlendRetrieverResult[source]#

Blocking function to get a single layer of K and V tensor. The returned the K and V tensor should match the length of the input tokens passed to the BlendRetriever.new_request function. If the KV of a token is not available, the vaild_mask will be 0, and the corresponding values in the KV tensor will be undefined.

Parameters:

layer_id (int) – the layer id

Returns:

The BlendRetrieverResult object

Return type:

BlendRetrieverResult

lmcache.blend.retriever module#

class lmcache.blend.retriever.SPTBlendRetriever(spt: Tensor, cache_engine: LMCacheEngine, metadata: LMCacheEngineMetadata)[source]#

Bases: BlendRetriever

Implement the retrieval logic using “SPecial Token” (SPT) as delimiter.

This implementation assumes that there MUST be a special token at the end of the input text chunk.

Example

Input = [x, x, x, spt, y, y, spt, z, z, z, z] Requests sent to LMCache engine: - [x, x, x, spt] - [y, y, spt] - [z, z, z, z]

Therefore, to use this retriever, the text chunks are better to also be ended with the special token.

new_request(input_tokens: Tensor, query_start_loc: Tensor) BlendRetrieverTask[source]#

Create a new BlendRetrieverTask to retrieve the KV caches. It may launch async tasks in the background during the retrieval.

Parameters:
  • input_tokens (torch.Tensor) – The input tokens, could include multiple requests in a batch

  • query_start_loc (torch.Tensor) – The start location of the query if input_tokens has multiple requests in a batch. The length should be the number of requests in the batch + 1.

Returns:

The retriever task to retrieve the KV caches

Return type:

BlendRetrieverTask

class lmcache.blend.retriever.SPTBlendRetrieverTask(token_segments: List[Tensor], tasks: List[Future], fmt: str)[source]#

Bases: BlendRetrieverTask

result(layer_id: int) BlendRetrieverResult[source]#

Blocking function to get a single layer of K and V tensor. The returned the K and V tensor should match the length of the input tokens passed to the BlendRetriever.new_request function.

Parameters:

layer_id (int) – the layer id

Returns:

Tuple of K and V tensor

Return type:

Tuple[torch.Tensor, torch.Tensor]

Module contents#