Blending#
CacheBlend enables KV cache reuse for non-prefix positions by recomputing a subset of tokens at non-prefix positions. For example, CacheBlend can combine multiple (pre-)computed KV caches, when their corresponding texts are concatenated in the LLM input
Configuring CacheBlend in RAG scenarios#
Here, we will explain the code in our end-to-end example>.
Below are some blending-related configurations (and explanations):
# Enable blending in LMCache
os.environ["LMCACHE_ENABLE_BLENDING"] = "True"
# Separator string between different chunks
os.environ["LMCACHE_BLEND_SPECIAL_STR"] = " # # "
# Layerwise must be turned on when blending is enabled
os.environ["LMCACHE_USE_LAYERWISE"] = "True"
# Determining which tokens to recompute at layer 1
os.environ["LMCACHE_BLEND_CHECK_LAYERS"] = "1"
# Ratio of tokens to recompute
os.environ["LMCACHE_BLEND_RECOMPUTE_RATIOS"] = "0.15"
# Optionally, we can use sparse attention to improve generation quality
# by using more accurate attention mask
if enable_sparse:
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"
os.environ["LMCACHE_EXTRA_CONFIG"] = '{"enable_sparse": true}'
Firstly, we preprocess texts into tokens, as tokenizing a concatenated string may produce different tokens than concatenating the results of tokenizing each string individually. For example, assume we have a system prompt and three text chunks. We need to preprocess them into tokens before sending to the LLM:
sys_prompt = tokenizer.encode("You are a very helpful assistant.")
chunk1_prompt = tokenizer.encode("Hello, how are you?" * 500)[1:]
chunk2_prompt = tokenizer.encode("Hello, what's up?" * 500)[1:]
chunk3_prompt = tokenizer.encode("Hi, what are you up to?" * 500)[1:]
blend_special_str = tokenizer.encode(os.getenv("LMCACHE_BLEND_SPECIAL_STR"))[1:]
first_prompt = (
sys_prompt
+ blend_special_str
+ chunk1_prompt
+ blend_special_str
+ chunk2_prompt
+ blend_special_str
+ chunk3_prompt
+ blend_special_str
+ tokenizer.encode("Hello, my name is")[1:]
)
Then, we can send the tokenized prompt to vLLM. Meanwhile, LMCache will store the KV caches of different chunks according to the BLEND_SPECIAL_STR
.
llm.generate(prompts={"prompt_token_ids": first_prompt})
Similarly, we build another prompt using the same chunks but with different orders.
second_prompt = (
sys_prompt
+ blend_special_str
+ chunk2_prompt
+ blend_special_str
+ chunk1_prompt
+ blend_special_str
+ chunk3_prompt
+ blend_special_str
+ tokenizer.encode("Hello, how are you?")[1:]
)
llm.generate(prompts={"prompt_token_ids": second_prompt})
Even though the second prompt has a different order of chunks, LMCache can still reuse the KV caches of chunk1, chunk2, and chunk3.