LMCache Experimental#
Subpackages#
- lmcache.experimental.storage_backend package
- Subpackages
- Submodules
- lmcache.experimental.storage_backend.abstract_backend module
- lmcache.experimental.storage_backend.local_disk_backend module
LocalDiskBackend
LocalDiskBackend.async_load_bytes_from_disk()
LocalDiskBackend.async_save_bytes_to_disk()
LocalDiskBackend.close()
LocalDiskBackend.contains()
LocalDiskBackend.exists_in_put_tasks()
LocalDiskBackend.get_blocking()
LocalDiskBackend.insert_key()
LocalDiskBackend.load_bytes_from_disk()
LocalDiskBackend.load_disk()
LocalDiskBackend.remove()
LocalDiskBackend.submit_prefetch_task()
LocalDiskBackend.submit_put_task()
- lmcache.experimental.storage_backend.storage_manager module
- Module contents
Submodules#
lmcache.experimental.cache_engine module#
- class lmcache.experimental.cache_engine.LMCacheEngine(config: LMCacheEngineConfig, metadata: LMCacheEngineMetadata, memory_allocator: MemoryAllocatorInterface, token_database: TokenDatabase, gpu_connector: GPUConnectorInterface)[source]#
The main class for the cache engine.
When storing the KV caches into the cache engine, it takes GPU KV caches from the serving engine and convert them into MemoryObjs that resides in the CPU. The MemoryObjs are then being stored into the StorageBackends in an asynchronous manner.
When retrieving the KV caches from the cache engine, it fetches the MemoryObjs from the StorageBackends and convert them into GPU KV caches by GPUConnectors specialized for the serving engine.
It also supports prefetching the KV caches from the StorageBackends. It relies on the StorageBackends to manage the requests of prefetching and real retrieval and avoid the conflicts.
- lookup(tokens: Tensor, search_range: List[str] | None = None) int [source]#
Checks the existence of KV cache of the tokens from the cache engine.
- Parameters:
tokens – the input tokens, with shape [seq_len]
search_range (Optional[List[str]]) – The range of storage backends
to search in. Should be a subset of [“Hot”, “LocalDiskBackend”] for now. If None, search in all backends.
- Returns:
An int indicating how many prefix tokens are cached.
- prefetch(tokens: Tensor, mask: Tensor | None = None) None [source]#
Launch the prefetching process in the storage manager to load the KV to the local CPU memory
- retrieve(tokens: Tensor, mask: Tensor | None = None, **kwargs) Tensor [source]#
Retrieve the KV caches from the cache engine. And put the retrieved KV cache to the serving engine via the GPU connector.
- Parameters:
tokens (torch.Tensor) – The tokens of the corresponding KV caches.
mask (Optional[torch.Tensor]) – The mask for the tokens. Should have the same length as tokens. And the mask should ALWAYS be like FFFFFTTTTTTT, where True means the tokens needs to be matched, and the Falses will ALWAYS be at the PREFIX of the tensor.
**kwargs –
The additional arguments for the storage backend which will be passed into the gpu_connector. Should include KV cache specific information (e.g., paged KV buffer and the page tables).
- Returns:
the boolean mask indicating which tokens are retrieved. The length of the mask should be the same as the tokens. On CPU.
- Raises:
ValueError if the number of Falses in the mask is not a multiple of the chunk size.
- store(tokens: Tensor, mask: Tensor | None = None, **kwargs) None [source]#
Store the tokens and mask into the cache engine.
- Parameters:
tokens (torch.Tensor) – The tokens of the corresponding KV caches.
mask (Optional[torch.Tensor]) – The mask for the tokens. Should have the same length as tokens. And the mask should ALWAYS be like FFFFFTTTTTTT, where True means the tokens needs to be matched, and the Falses will ALWAYS be at the PREFIX of the tensor.
**kwargs –
The additional arguments for the storage backend which will be passed into the gpu_connector. Should include KV cache specific information (e.g., paged KV buffer and the page tables).
- Raises:
ValueError if the number of Falses in the mask is not a multiple of the chunk size.
- class lmcache.experimental.cache_engine.LMCacheEngineBuilder[source]#
- classmethod destroy(instance_id: str) None [source]#
Close and delete the LMCacheEngine instance by the instance ID
- classmethod get(instance_id: str) LMCacheEngine | None [source]#
Returns the LMCacheEngine instance associated with the instance ID, or None if not found.
- classmethod get_or_create(instance_id: str, config: LMCacheEngineConfig, metadata: LMCacheEngineMetadata, gpu_connector: GPUConnectorInterface) LMCacheEngine [source]#
Builds a new LMCacheEngine instance if it doesn’t already exist for the given ID.
- raises: ValueError if the instance already exists with a different
configuration.
lmcache.experimental.config module#
- class lmcache.experimental.config.LMCacheEngineConfig(chunk_size: int, local_cpu: bool, max_local_cpu_size: float, local_disk: str | None, max_local_disk_size: float, remote_url: str | None, remote_serde: str | None, save_decode_cache: bool, enable_blending: bool, blend_recompute_ratio: float, blend_min_tokens: int, enable_p2p: bool, lookup_url: str | None, distributed_url: str | None, error_handling: bool)[source]#
-
- static from_defaults(chunk_size: int = 256, local_cpu: bool = True, max_local_cpu_size: float = 5.0, local_disk: str | None = None, max_local_disk_size: int = 0, remote_url: str | None = 'lm://localhost:65432', remote_serde: str | None = 'naive', save_decode_cache: bool = False, enable_blending: bool = False, blend_recompute_ratio: float = 0.15, blend_min_tokens: int = 256, enable_p2p: bool = False, lookup_url: str | None = None, distributed_url: str | None = None, error_handling: bool = False) LMCacheEngineConfig [source]#
- static from_env() LMCacheEngineConfig [source]#
Load the config from the environment variables It will first create a config by from_defaults and overwrite the configuration values from the environment variables. The environment variables should starts with LMCACHE and be in uppercase. For example, LMCACHE_CHUNK_SIZE. :note: the default configuration only uses cpu
- static from_file(file_path: str) LMCacheEngineConfig [source]#
Load the config from a yaml file
- static from_legacy(chunk_size: int = 256, backend: str = 'cpu', remote_url: str | None = 'lm://localhost:65432', remote_serde: str = 'naive', save_decode_cache: bool = False, enable_blending: bool = False, blend_recompute_ratio: float = 0.15, blend_min_tokens: int = 256, max_local_disk_size: float = 0.0, enable_p2p: bool = False, lookup_url: str | None = None, distributed_url: str | None = None, error_handling: bool = False) LMCacheEngineConfig [source]#
lmcache.experimental.gpu_connector module#
- class lmcache.experimental.gpu_connector.GPUConnectorInterface[source]#
- abstract from_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#
Load the data from a GPU buffer into the memory object. Sub-classes should define the format of the kwargs.
- abstract get_shape(num_tokens: int) Size [source]#
Get the shape of the data given the number of tokens.
- class lmcache.experimental.gpu_connector.VLLMNestedTupleGPUConnector(hidden_dim_size: int, num_layers: int)[source]#
Bases:
GPUConnectorInterface
The GPU KV cache should be a nested tuple of K and V tensors. More specifically, we have: - GPUTensor = Tuple[KVLayer, …] - KVLayer = Tuple[Tensor, Tensor] - Tensor: [num_tokens, …]
The token dimension is specified by token_dim when constructing the connector.
It will produce / consume memory object with KV_BLOB format
- from_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#
Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.
- Raises:
ValueError – If ‘kvcaches’ is not provided in kwargs, or the memory object is not in KV_BLOB format.
AssertionError – If the memory object does not have a tensor.
- to_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#
Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.
- Raises:
ValueError – If ‘kvcaches’ is not provided in kwargs.
AssertionError – If the memory object does not have a tensor.
- class lmcache.experimental.gpu_connector.VLLMPagedMemGPUConnector(hidden_dim_size: int, num_layers: int)[source]#
Bases:
GPUConnectorInterface
The GPU KV cache should be a nested tuple of K and V tensors. More specifically, we have: - GPUTensor = Tuple[KVLayer, …] - KVLayer = Tuple[Tensor, Tensor] - Tensor: [num_blocks, block_size, num_heads, head_size]
It will produce / consume memory object with KV_BLOB format
- from_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#
Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.
- Raises:
ValueError – If ‘kvcaches’ is not provided in kwargs, or the memory object is not in KV_BLOB format.
AssertionError – If the memory object does not have a tensor.
ValueError – If ‘slot_mapping’ is not provided in kwargs.
- to_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#
Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.
- Raises:
ValueError – If ‘kvcaches’ is not provided in kwargs.
AssertionError – If the memory object does not have a tensor.
ValueError – If ‘slot_mapping’ is not provided in kwargs.
- class lmcache.experimental.gpu_connector.VLLMPagedMemGPUConnectorV2(hidden_dim_size: int, num_layers: int, use_gpu: bool = False, **kwargs)[source]#
Bases:
GPUConnectorInterface
The GPU KV cache should be a nested tuple of K and V tensors. More specifically, we have: - GPUTensor = Tuple[KVLayer, …] - KVLayer = Tuple[Tensor, Tensor] - Tensor: [num_blocks, block_size, num_heads, head_size]
It will produce / consume memory object with KV_BLOB format
- from_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#
Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.
Will set the memory_obj.metadata.fmt to MemoryFormat.KV_BLOB.
Note
This function expects the ‘slot_mapping’ is a “full slot mapping” where it’s length is the same as the whole token sequence.
In the case that there is prefix caching, slot_mapping will starts with -1s until the end of the matched prefix. The start and end should NEVER overlap with the prefix caching (which means the underlying CUDA kernel will never see -1 in slot_mapping)
- Raises:
ValueError – If ‘kvcaches’ is not provided in kwargs,
AssertionError – If the memory object does not have a tensor.
ValueError – If ‘slot_mapping’ is not provided in kwargs.
- to_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#
Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.
Note
This function expects the ‘slot_mapping’ is a “full slot mapping” where it’s length is the same as the whole token sequence.
In the case that there is prefix caching, slot_mapping will starts with -1s until the end of the matched prefix. The start and end should NEVER overlap with the prefix caching (which means the underlying CUDA kernel will never see -1 in slot_mapping)
- Raises:
ValueError – If ‘kvcaches’ is not provided in kwargs.
AssertionError – If the memory object does not have a tensor.
ValueError – If ‘slot_mapping’ is not provided in kwargs.
lmcache.experimental.memory_management module#
- class lmcache.experimental.memory_management.BufferAllocator(device='cpu')[source]#
Bases:
MemoryAllocatorInterface
Allocates memory in the pre-allocated pinned memory.
- allocate(shape: Size | Tuple[int, ...], dtype: dtype | None, fmt: MemoryFormat = MemoryFormat.BINARY_BUFFER) BytesBufferMemoryObj [source]#
Allocates the memory to hold a tensor of the given shape.
- Parameters:
shape (torch.Size) – The shape of the tensor to allocate.
dtype (torch.dtype) – The dtype of the tensor to allocate.
fmt (MemoryFormat) – The format of the memory to allocate.
- Returns:
A MemoryObj wrapping the allocated memory. Returns None if the allocation failed.
- Return type:
Optional[MemoryObj]
- free(memory_obj: MemoryObj)[source]#
Frees the memory allocated for the given MemoryObj. Note that this function shouldn’t be explicitly called. Instead, use ref_count_down to decrease ref count.
- Parameters:
memory_obj (MemoryObj) – The MemoryObj to free.
- get_ref_count(memory_obj: MemoryObj)[source]#
Get ref count for the given MemoryObj.
:param MemoryObj memory_obj.
- class lmcache.experimental.memory_management.BytesBufferMemoryObj(raw_bytes: bytes, metadata: MemoryObjMetadata | None = None)[source]#
Bases:
MemoryObj
Wraps a raw flat tensor with some metadata
- get_memory_format() MemoryFormat [source]#
Get the memory format of the MemoryObj.
- property metadata: MemoryObjMetadata#
Get the metada of the MemoryObj.
- class lmcache.experimental.memory_management.FreeBlock(start: int, size: int)[source]#
Metadata class used by the memory allocators
- class lmcache.experimental.memory_management.GPUMemoryAllocator(size: int, device='cuda')[source]#
Bases:
MemoryAllocatorInterface
Allocates memory in the pre-allocated Host memory.
- allocate(shape: Size | Tuple[int, ...], dtype: dtype | None, fmt: MemoryFormat = MemoryFormat.KV_BLOB) MemoryObj | None [source]#
Allocates the memory to hold a tensor of the given shape.
- Parameters:
shape (torch.Size) – The shape of the tensor to allocate.
dtype (torch.dtype) – The dtype of the tensor to allocate.
fmt (MemoryFormat) – The format of the memory to allocate.
- Returns:
A MemoryObj wrapping the allocated memory. Returns None if the allocation failed.
- Return type:
Optional[MemoryObj]
- free(memory_obj: MemoryObj)[source]#
Frees the memory allocated for the given MemoryObj. Note that this function shouldn’t be explicitly called. Instead, use ref_count_down to decrease ref count.
- Parameters:
memory_obj (MemoryObj) – The MemoryObj to free.
- get_ref_count(memory_obj: MemoryObj)[source]#
Get ref count for the given MemoryObj.
:param MemoryObj memory_obj.
- class lmcache.experimental.memory_management.HostMemoryAllocator(size: int)[source]#
Bases:
MemoryAllocatorInterface
Allocates memory in the pre-allocated Host memory.
- allocate(shape: Size | Tuple[int, ...], dtype: dtype | None, fmt: MemoryFormat = MemoryFormat.KV_BLOB) MemoryObj | None [source]#
Allocates the memory to hold a tensor of the given shape.
- Parameters:
shape (torch.Size) – The shape of the tensor to allocate.
dtype (torch.dtype) – The dtype of the tensor to allocate.
fmt (MemoryFormat) – The format of the memory to allocate.
- Returns:
A MemoryObj wrapping the allocated memory. Returns None if the allocation failed.
- Return type:
Optional[MemoryObj]
- free(memory_obj: MemoryObj)[source]#
Frees the memory allocated for the given MemoryObj. Note that this function shouldn’t be explicitly called. Instead, use ref_count_down to decrease ref count.
- Parameters:
memory_obj (MemoryObj) – The MemoryObj to free.
- get_ref_count(memory_obj: MemoryObj)[source]#
Get ref count for the given MemoryObj.
:param MemoryObj memory_obj.
- class lmcache.experimental.memory_management.MemoryAllocatorInterface[source]#
- abstract allocate(shape: Size | Tuple[int, ...], dtype: dtype | None, fmt: MemoryFormat = MemoryFormat.UNDEFINED) MemoryObj | None [source]#
Allocates the memory to hold a tensor of the given shape.
- Parameters:
shape (torch.Size) – The shape of the tensor to allocate.
dtype (torch.dtype) – The dtype of the tensor to allocate.
fmt (MemoryFormat) – The format of the memory to allocate.
- Returns:
A MemoryObj wrapping the allocated memory. Returns None if the allocation failed.
- Return type:
Optional[MemoryObj]
- abstract free(memory_obj: MemoryObj)[source]#
Frees the memory allocated for the given MemoryObj. Note that this function shouldn’t be explicitly called. Instead, use ref_count_down to decrease ref count.
- Parameters:
memory_obj (MemoryObj) – The MemoryObj to free.
- abstract get_ref_count(memory_obj: MemoryObj)[source]#
Get ref count for the given MemoryObj.
:param MemoryObj memory_obj.
- class lmcache.experimental.memory_management.MemoryFormat(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
- BINARY = 2#
- BINARY_BUFFER = 3#
- KV_BLOB = 1#
Compressed binary array format
- UNDEFINED = 0#
[2, num_layers, num_tokens, hidden_dim]
- class lmcache.experimental.memory_management.MemoryObj[source]#
MemoryObj interface.
- abstract get_memory_format() MemoryFormat [source]#
Get the memory format of the MemoryObj.
- abstract property metadata: MemoryObjMetadata#
Get the metada of the MemoryObj.
- class lmcache.experimental.memory_management.MemoryObjMetadata(shape: torch.Size, dtype: Optional[torch.dtype], address: int, phy_size: int, ref_count: int, fmt: lmcache.experimental.memory_management.MemoryFormat = <MemoryFormat.UNDEFINED: 0>)[source]#
-
- fmt: MemoryFormat = 0#
- class lmcache.experimental.memory_management.MixedMemoryAllocator(size: int)[source]#
Bases:
MemoryAllocatorInterface
- Allocates (1) memory in the pre-allocated pinned memory.
byte_array buffer memory.
- allocate(shape: Size | Tuple[int, ...], dtype: dtype | None, fmt: MemoryFormat = MemoryFormat.KV_BLOB) MemoryObj | None [source]#
Allocates the memory to hold a tensor of the given shape.
- Parameters:
shape (torch.Size) – The shape of the tensor to allocate.
dtype (torch.dtype) – The dtype of the tensor to allocate.
fmt (MemoryFormat) – The format of the memory to allocate.
- Returns:
A MemoryObj wrapping the allocated memory. Returns None if the allocation failed.
- Return type:
Optional[MemoryObj]
- free(memory_obj: MemoryObj)[source]#
Frees the memory allocated for the given MemoryObj. Note that this function shouldn’t be explicitly called. Instead, use ref_count_down to decrease ref count.
- Parameters:
memory_obj (MemoryObj) – The MemoryObj to free.
- get_ref_count(memory_obj: MemoryObj)[source]#
Get ref count for the given MemoryObj.
:param MemoryObj memory_obj.
- class lmcache.experimental.memory_management.PinMemoryAllocator(size: int)[source]#
Bases:
MemoryAllocatorInterface
Allocates memory in the pre-allocated pinned memory.
- allocate(shape: Size | Tuple[int, ...], dtype: dtype | None, fmt: MemoryFormat = MemoryFormat.KV_BLOB) MemoryObj | None [source]#
Allocates the memory to hold a tensor of the given shape.
- Parameters:
shape (torch.Size) – The shape of the tensor to allocate.
dtype (torch.dtype) – The dtype of the tensor to allocate.
fmt (MemoryFormat) – The format of the memory to allocate.
- Returns:
A MemoryObj wrapping the allocated memory. Returns None if the allocation failed.
- Return type:
Optional[MemoryObj]
- free(memory_obj: MemoryObj)[source]#
Frees the memory allocated for the given MemoryObj. Note that this function shouldn’t be explicitly called. Instead, use ref_count_down to decrease ref count.
- Parameters:
memory_obj (MemoryObj) – The MemoryObj to free.
- get_ref_count(memory_obj: MemoryObj)[source]#
Get ref count for the given MemoryObj.
:param MemoryObj memory_obj.
- class lmcache.experimental.memory_management.TensorMemoryAllocator(tensor: Tensor)[source]#
Bases:
MemoryAllocatorInterface
Implements a “explicit list” memory allocator.
- ALIGN_BYTES = 512#
- allocate(shape: Size | Tuple[int, ...], dtype: dtype | None, fmt: MemoryFormat = MemoryFormat.KV_BLOB) TensorMemoryObj | None [source]#
Allocates the memory to hold a tensor of the given shape.
- Parameters:
shape (torch.Size) – The shape of the tensor to allocate.
dtype (torch.dtype) – The dtype of the tensor to allocate.
fmt (MemoryFormat) – The format of the memory to allocate.
- Returns:
A MemoryObj wrapping the allocated memory. Returns None if the allocation failed.
- Return type:
Optional[MemoryObj]
- free(memory_obj: MemoryObj)[source]#
Frees the memory allocated for the given MemoryObj. Note that this function shouldn’t be explicitly called. Instead, use ref_count_down to decrease ref count.
- Parameters:
memory_obj (MemoryObj) – The MemoryObj to free.
- get_ref_count(memory_obj: MemoryObj)[source]#
Get ref count for the given MemoryObj.
:param MemoryObj memory_obj.
- class lmcache.experimental.memory_management.TensorMemoryObj(raw_data: Tensor, metadata: MemoryObjMetadata)[source]#
Bases:
MemoryObj
Wraps a raw flat tensor with some metadata
- get_memory_format() MemoryFormat [source]#
Get the memory format of the MemoryObj.
- property metadata: MemoryObjMetadata#
Get the metada of the MemoryObj.
lmcache.experimental.token_database module#
- class lmcache.experimental.token_database.ChunkedTokenDatabase(config: LMCacheEngineConfig, metadata: LMCacheEngineMetadata)[source]#
Bases:
TokenDatabase
- process_tokens(tokens: Tensor, mask: Tensor | None = None) Iterable[Tuple[int, int, CacheEngineKey]] [source]#
Process the tokens and return the corresponding cache engine keys.
- Parameters:
tokens (torch.Tensor) – The tokens to process, in 1-D CPU tensor.
mask (Optional[torch.Tensor]) – The mask for the tokens. Should have the same length as tokens. And the mask should ALWAYS be like FFFFFTTTTTTT, where True means the tokens needs to be matched, and the Falses will ALWAYS be at the PREFIX of the tensor.
- Returns:
A iterable of tuples with three elements. The first element is the start index of the tokens for the key. The second element is the end index of the tokens for the key. The third element is the cache engine key for the tokens.
- Raises:
ValueError if the number of Falses in the mask is not a multiple of the chunk size.
- class lmcache.experimental.token_database.TokenDatabase[source]#
TokenDatabase is used to convert input tokens into list of cache engine keys. There are multiple ways to implement this:
ChunkedTokenDatabase: It processes tokens into chunks and convert
each chunk into a cache engine key using prefix hash.
RadixTokenDatabase: more advanced implementation using radix tree.
- abstract process_tokens(tokens: Tensor, mask: Tensor | None = None) Iterable[Tuple[int, int, CacheEngineKey]] [source]#
Process the tokens and return the corresponding cache engine keys.
- Parameters:
tokens (torch.Tensor) – The tokens to process, in 1-D CPU tensor.
mask (Optional[torch.Tensor]) – The mask for the tokens. Should have the same length as tokens. And the mask should ALWAYS be like FFFFFTTTTTTT, where True means the tokens needs to be matched, and the Falses will ALWAYS be at the PREFIX of the tensor.
- Returns:
A iterable of tuples with three elements. The first element is the start index of the tokens for the key. The second element is the end index of the tokens for the key. The third element is the cache engine key for the tokens.