LMCache Blend#

Submodules#

lmcache.blend.executor module#

class lmcache.blend.executor.CacheBlendImpl(recompute_ratio: float)[source]#

Bases: BlendExecutor

blend(layer_id: int, retrieved_k: Tensor, retrieved_v: Tensor, valid_mask: Tensor, original_positions: Tensor, fresh_q: Tensor, fresh_k: Tensor, fresh_v: Tensor, positions: Tensor, query_start_loc: Tensor, token_dim: int) BlendOutput[source]#

This function blends the retrieved KV with fresh KVs, and returns the short Q + long KV (blended) + positions of the tokens in Q

Parameters:
  • layer_id (int) – The layer id

  • retrieved_k (torch.Tensor) – The retrieved K layer, in shape [num_tokens, hidden_dims]

  • retrieved_v (torch.Tensor) – The retrieved V layer, in shape [num_tokens, hidden_dims]

  • valid_mask (torch.Tensor) – A CPU tensor returned from the retriever indicating whether the KV is valid.

  • original_positions (torch.Tensor) – The original positions of the tokens in the retrieved KV

  • fresh_q (torch.Tensor) – The fresh Q tensor from QKV split, in shape [num_tokens, hidden_dims]

  • fresh_k (torch.Tensor) – The fresh K tensor from QKV split, in shape [num_tokens, hidden_dims]

  • fresh_v (torch.Tensor) – The fresh V tensor from QKV split, in shape [num_tokens, hidden_dims]

  • positions (torch.Tensor) – The positions in the input of the tokens in the fresh_q

  • query_start_loc (torch.Tensor) – The start location of the query if input_tokens has multiple requests in a batch. The length should be the number of requests in the batch + 1. Note this will NOT be changed after token selection.

  • token_dim (int) – The token dimension

Returns:

The blended Q, K, V, and positions

set_positional_encoder(positional_encoder: Callable[[Tensor, Tensor, Tensor], Tuple[Tensor, Tensor]])[source]#
set_reverse_positional_encoder(reverse_positional_encoder: Callable[[Tensor, Tensor, Tensor], Tuple[Tensor, Tensor]])[source]#
lmcache.blend.executor.create_index(ndims, target_dim, index)[source]#
lmcache.blend.executor.indices_to_mask(indices, size)[source]#
lmcache.blend.executor.mask_to_indices(mask)[source]#

lmcache.blend.interfaces module#

class lmcache.blend.interfaces.BlendExecutor[source]#

The interface for the cacheblend executor to blend the retrieved KV with fresh KVs

abstract blend(layer_id: int, retrieved_k: Tensor, retrieved_v: Tensor, valid_mask: Tensor, original_positions: Tensor, fresh_q: Tensor, fresh_k: Tensor, fresh_v: Tensor, positions: Tensor, query_start_loc: Tensor, token_dim: int) BlendOutput[source]#

This function blends the retrieved KV with fresh KVs, and returns the short Q + long KV (blended) + positions of the tokens in Q

Parameters:
  • layer_id (int) – The layer id

  • retrieved_k (torch.Tensor) – The retrieved K tensor

  • retrieved_v (torch.Tensor) – The retrieved V tensor

  • valid_mask (torch.Tensor) – A CPU tensor returned from the retriever indicating whether the KV is valid.

  • original_positions (torch.Tensor) – The original positions of the tokens in the retrieved KV

  • fresh_q (torch.Tensor) – The fresh Q tensor from QKV split

  • fresh_k (torch.Tensor) – The fresh K tensor from QKV split

  • fresh_v (torch.Tensor) – The fresh V tensor from QKV split

  • positions (torch.Tensor) – The positions in the input of the tokens in the fresh_q

  • query_start_loc (torch.Tensor) – The start location of the query if input_tokens has multiple requests in a batch. The length should be the number of requests in the batch + 1

  • token_dim (int) – The token dimension

Returns:

The blended Q, K, V, and positions

class lmcache.blend.interfaces.BlendOutput(q: Tensor, k: Tensor, v: Tensor, positions: Tensor, local_indices: Tensor, query_start_loc: Tensor | None)[source]#

The output of the cacheblend module

Variables:
  • q (torch.Tensor) – The short Q tensor with selected tokens

  • k (torch.Tensor) – The long K tensor with the updated values

  • v (torch.Tensor) – The long V tensor with the updated values

  • positions (torch.Tensor) – The positions of the selected Q tokens in the input sequence

  • local_indices (torch.Tensor) – The positions of the selected Q tokens in fresh q

  • query_start_loc (Optional[torch.Tensor]) – The modified query_start_loc if token selection has happened. Will be None if no selection has happened.

k: Tensor#
local_indices: Tensor#
positions: Tensor#
q: Tensor#
query_start_loc: Tensor | None#
v: Tensor#
class lmcache.blend.interfaces.BlendRetriever[source]#

The interface for the cacheblend retriever to retrieve the KV caches

It takes in input tokens and ROI as input, and launch some tasks (maybe async), and return a BlendRetrieverTask to retrieve the KV caches.

abstract new_request(full_prompts: List[Tensor], indices: List[List[int]]) BlendRetrieverTask[source]#

Create a new BlendRetrieverTask to retrieve the KV caches. It may launch async tasks in the background during the retrieval.

Parameters:

full_prompts (List[torch.Tensor]) – The full prompts for each

request in this batch. :param List[List[int]] indices: The indices of where the segmengted requests start in the full prompts.

Returns:

The retriever task to retrieve the KV caches

Return type:

BlendRetrieverTask

class lmcache.blend.interfaces.BlendRetrieverResult(k: Tensor | None, v: Tensor | None, valid_mask: Tensor, original_positions: Tensor)[source]#

The result of the cacheblend retriever

Variables:
  • k (torch.Tensor) – The K tensor of a single layer, will be None if nothing is retrieved

  • v (torch.Tensor) – The V tensor of a single layer, will be None if nothing is retrieved

  • valid_mask (torch.Tensor) – The valid mask on CPU

  • original_positions (torch.Tensor) – The original positions of the retrieved KV in the input sequence. If the corresponding KV is not valid, the position will be 0. This tensor will be on the same device as K and V.

k: Tensor | None#
original_positions: Tensor#
v: Tensor | None#
valid_mask: Tensor#
class lmcache.blend.interfaces.BlendRetrieverTask[source]#

The KV retrieval task created by the BlendRetriever

abstract result(layer_id: int) BlendRetrieverResult[source]#

Blocking function to get a single layer of K and V tensor. The returned the K and V tensor should match the length of the input tokens passed to the BlendRetriever.new_request function. If the KV of a token is not available, the vaild_mask will be 0, and the corresponding values in the KV tensor will be undefined.

Parameters:

layer_id (int) – the layer id

Returns:

The BlendRetrieverResult object

Return type:

BlendRetrieverResult

lmcache.blend.retriever module#

class lmcache.blend.retriever.SPTBlendRetriever(spt: List[int], cache_engine: LMCacheEngine, metadata: LMCacheEngineMetadata)[source]#

Bases: BlendRetriever

Implement the retrieval logic using “SPecial Token” (SPT) as delimiter.

This implementation assumes that there MUST be a special token at the end of the input text chunk.

Example

Input = [x, x, x, spt, y, y, spt, z, z, z, z]

Requests sent to LMCache engine when using drop_spt_and_get_indices and new_request: - [x, x, x] - [y, y] - [z, z, z, z]

Therefore, to use this retriever, the text chunks are better to also be ended with the special token.

drop_spt_and_get_indices(full_prompt: List[int]) Tuple[List[int], List[int]][source]#

Drop the special token and get the indices of the split requests.

Parameters:

full_prompt (List[int]) – The full prompt after tokenization.

Returns:

The new prompts without the special token and the indices of the split segments. The indices is recording the start of each segment, ending with the end of the full prompt. e.g. [0, index_of_segment2, len(full_prompt)]

new_request(full_prompts: List[Tensor], indices: List[List[int]]) BlendRetrieverTask[source]#

Create a new BlendRetrieverTask to retrieve the KV caches. It may launch async tasks in the background during the retrieval.

Parameters:

full_prompts (List[torch.Tensor]) – The full prompts for each

request in this batch, which will contain the tokens hitting the vLLM’s internal prefix caching. :param List[List[int]] indices: The indices of where the segmengted requests start in the full prompts.

Returns:

The retriever task to retrieve the KV caches

Return type:

BlendRetrieverTask

class lmcache.blend.retriever.SPTBlendRetrieverTask(token_segments: List[Tensor], tasks: List[Future], fmt: str)[source]#

Bases: BlendRetrieverTask

result(layer_id: int) BlendRetrieverResult[source]#

Blocking function to get a single layer of K and V tensor. The returned the K and V tensor should match the length of the input tokens passed to the BlendRetriever.new_request function.

Parameters:

layer_id (int) – the layer id

Returns:

Tuple of K and V tensor

Return type:

Tuple[torch.Tensor, torch.Tensor]

Module contents#