LMCache Blend#
Submodules#
lmcache.blend.executor module#
- class lmcache.blend.executor.CacheBlendImpl(recompute_ratio: float)[source]#
Bases:
BlendExecutor
- blend(layer_id: int, retrieved_k: Tensor, retrieved_v: Tensor, valid_mask: Tensor, original_positions: Tensor, fresh_q: Tensor, fresh_k: Tensor, fresh_v: Tensor, positions: Tensor, query_start_loc: Tensor, token_dim: int) BlendOutput [source]#
This function blends the retrieved KV with fresh KVs, and returns the short Q + long KV (blended) + positions of the tokens in Q
- Parameters:
layer_id (int) – The layer id
retrieved_k (torch.Tensor) – The retrieved K layer, in shape [num_tokens, hidden_dims]
retrieved_v (torch.Tensor) – The retrieved V layer, in shape [num_tokens, hidden_dims]
valid_mask (torch.Tensor) – A CPU tensor returned from the retriever indicating whether the KV is valid.
original_positions (torch.Tensor) – The original positions of the tokens in the retrieved KV
fresh_q (torch.Tensor) – The fresh Q tensor from QKV split, in shape [num_tokens, hidden_dims]
fresh_k (torch.Tensor) – The fresh K tensor from QKV split, in shape [num_tokens, hidden_dims]
fresh_v (torch.Tensor) – The fresh V tensor from QKV split, in shape [num_tokens, hidden_dims]
positions (torch.Tensor) – The positions in the input of the tokens in the fresh_q
query_start_loc (torch.Tensor) – The start location of the query if input_tokens has multiple requests in a batch. The length should be the number of requests in the batch + 1. Note this will NOT be changed after token selection.
token_dim (int) – The token dimension
- Returns:
The blended Q, K, V, and positions
lmcache.blend.interfaces module#
- class lmcache.blend.interfaces.BlendExecutor[source]#
The interface for the cacheblend executor to blend the retrieved KV with fresh KVs
- abstract blend(layer_id: int, retrieved_k: Tensor, retrieved_v: Tensor, valid_mask: Tensor, original_positions: Tensor, fresh_q: Tensor, fresh_k: Tensor, fresh_v: Tensor, positions: Tensor, query_start_loc: Tensor, token_dim: int) BlendOutput [source]#
This function blends the retrieved KV with fresh KVs, and returns the short Q + long KV (blended) + positions of the tokens in Q
- Parameters:
layer_id (int) – The layer id
retrieved_k (torch.Tensor) – The retrieved K tensor
retrieved_v (torch.Tensor) – The retrieved V tensor
valid_mask (torch.Tensor) – A CPU tensor returned from the retriever indicating whether the KV is valid.
original_positions (torch.Tensor) – The original positions of the tokens in the retrieved KV
fresh_q (torch.Tensor) – The fresh Q tensor from QKV split
fresh_k (torch.Tensor) – The fresh K tensor from QKV split
fresh_v (torch.Tensor) – The fresh V tensor from QKV split
positions (torch.Tensor) – The positions in the input of the tokens in the fresh_q
query_start_loc (torch.Tensor) – The start location of the query if input_tokens has multiple requests in a batch. The length should be the number of requests in the batch + 1
token_dim (int) – The token dimension
- Returns:
The blended Q, K, V, and positions
- class lmcache.blend.interfaces.BlendOutput(q: Tensor, k: Tensor, v: Tensor, positions: Tensor, local_indices: Tensor, query_start_loc: Tensor | None)[source]#
The output of the cacheblend module
- Variables:
q (torch.Tensor) – The short Q tensor with selected tokens
k (torch.Tensor) – The long K tensor with the updated values
v (torch.Tensor) – The long V tensor with the updated values
positions (torch.Tensor) – The positions of the selected Q tokens in the input sequence
local_indices (torch.Tensor) – The positions of the selected Q tokens in fresh q
query_start_loc (Optional[torch.Tensor]) – The modified query_start_loc if token selection has happened. Will be None if no selection has happened.
- class lmcache.blend.interfaces.BlendRetriever[source]#
The interface for the cacheblend retriever to retrieve the KV caches
It takes in input tokens and ROI as input, and launch some tasks (maybe async), and return a BlendRetrieverTask to retrieve the KV caches.
- abstract new_request(full_prompts: List[Tensor], indices: List[List[int]]) BlendRetrieverTask [source]#
Create a new BlendRetrieverTask to retrieve the KV caches. It may launch async tasks in the background during the retrieval.
- Parameters:
full_prompts (List[torch.Tensor]) – The full prompts for each
request in this batch. :param List[List[int]] indices: The indices of where the segmengted requests start in the full prompts.
- Returns:
The retriever task to retrieve the KV caches
- Return type:
- class lmcache.blend.interfaces.BlendRetrieverResult(k: Tensor | None, v: Tensor | None, valid_mask: Tensor, original_positions: Tensor)[source]#
The result of the cacheblend retriever
- Variables:
k (torch.Tensor) – The K tensor of a single layer, will be None if nothing is retrieved
v (torch.Tensor) – The V tensor of a single layer, will be None if nothing is retrieved
valid_mask (torch.Tensor) – The valid mask on CPU
original_positions (torch.Tensor) – The original positions of the retrieved KV in the input sequence. If the corresponding KV is not valid, the position will be 0. This tensor will be on the same device as K and V.
- class lmcache.blend.interfaces.BlendRetrieverTask[source]#
The KV retrieval task created by the BlendRetriever
- abstract result(layer_id: int) BlendRetrieverResult [source]#
Blocking function to get a single layer of K and V tensor. The returned the K and V tensor should match the length of the input tokens passed to the BlendRetriever.new_request function. If the KV of a token is not available, the vaild_mask will be 0, and the corresponding values in the KV tensor will be undefined.
- Parameters:
layer_id (int) – the layer id
- Returns:
The BlendRetrieverResult object
- Return type:
lmcache.blend.retriever module#
- class lmcache.blend.retriever.SPTBlendRetriever(spt: List[int], cache_engine: LMCacheEngine, metadata: LMCacheEngineMetadata)[source]#
Bases:
BlendRetriever
Implement the retrieval logic using “SPecial Token” (SPT) as delimiter.
This implementation assumes that there MUST be a special token at the end of the input text chunk.
Example
Input = [x, x, x, spt, y, y, spt, z, z, z, z]
Requests sent to LMCache engine when using drop_spt_and_get_indices and new_request: - [x, x, x] - [y, y] - [z, z, z, z]
Therefore, to use this retriever, the text chunks are better to also be ended with the special token.
- drop_spt_and_get_indices(full_prompt: List[int]) Tuple[List[int], List[int]] [source]#
Drop the special token and get the indices of the split requests.
- Parameters:
full_prompt (List[int]) – The full prompt after tokenization.
- Returns:
The new prompts without the special token and the indices of the split segments. The indices is recording the start of each segment, ending with the end of the full prompt. e.g. [0, index_of_segment2, len(full_prompt)]
- new_request(full_prompts: List[Tensor], indices: List[List[int]]) BlendRetrieverTask [source]#
Create a new BlendRetrieverTask to retrieve the KV caches. It may launch async tasks in the background during the retrieval.
- Parameters:
full_prompts (List[torch.Tensor]) – The full prompts for each
request in this batch, which will contain the tokens hitting the vLLM’s internal prefix caching. :param List[List[int]] indices: The indices of where the segmengted requests start in the full prompts.
- Returns:
The retriever task to retrieve the KV caches
- Return type:
- class lmcache.blend.retriever.SPTBlendRetrieverTask(token_segments: List[Tensor], tasks: List[Future], fmt: str)[source]#
Bases:
BlendRetrieverTask
- result(layer_id: int) BlendRetrieverResult [source]#
Blocking function to get a single layer of K and V tensor. The returned the K and V tensor should match the length of the input tokens passed to the BlendRetriever.new_request function.
- Parameters:
layer_id (int) – the layer id
- Returns:
Tuple of K and V tensor
- Return type:
Tuple[torch.Tensor, torch.Tensor]