LMCache Experimental#

Subpackages#

lmcache.experimental.storage_backend package

Submodules#

lmcache.experimental.cache_engine module#

class lmcache.experimental.cache_engine.CacheEngineEndSignal[source]#

class lmcache.experimental.cache_engine.LMCacheEngine(config: LMCacheEngineConfig, metadata: LMCacheEngineMetadata, memory_allocator: MemoryAllocatorInterface, token_database: TokenDatabase, gpu_connector: GPUConnectorInterface)[source]#

The main class for the cache engine.

When storing the KV caches into the cache engine, it takes GPU KV caches from the serving engine and convert them into MemoryObjs that resides in the CPU. The MemoryObjs are then being stored into the StorageBackends in an asynchronous manner.

When retrieving the KV caches from the cache engine, it fetches the MemoryObjs from the StorageBackends and convert them into GPU KV caches by GPUConnectors specialized for the serving engine.

It also supports prefetching the KV caches from the StorageBackends. It relies on the StorageBackends to manage the requests of prefetching and real retrieval and avoid the conflicts.

close() → None[source]#: Close the cache engine and free all the resources

lookup(tokens: Tensor, search_range: List[str] | None = None) → int[source]#

Checks the existence of KV cache of the tokens from the cache engine.

Parameters:

tokens – the input tokens, with shape [seq_len]
search_range (Optional[List[str]]) – The range of storage backends

to search in. Should be a subset of [“Hot”, “LocalDiskBackend”] for now. If None, search in all backends.

Returns:: An int indicating how many prefix tokens are cached.

prefetch(tokens: Tensor, mask: Tensor | None = None) → None[source]#: Launch the prefetching process in the storage manager to load the KV to the local CPU memory

retrieve(tokens: Tensor, mask: Tensor | None = None, **kwargs) → Tensor[source]#

Retrieve the KV caches from the cache engine. And put the retrieved KV cache to the serving engine via the GPU connector.

Parameters:

tokens (torch.Tensor) – The tokens of the corresponding KV caches.
mask (Optional[torch.Tensor]) – The mask for the tokens. Should have the same length as tokens. And the mask should ALWAYS be like FFFFFTTTTTTT, where True means the tokens needs to be matched, and the Falses will ALWAYS be at the PREFIX of the tensor.
**kwargs –
The additional arguments for the storage backend which will be passed into the gpu_connector. Should include KV cache specific information (e.g., paged KV buffer and the page tables).

Returns:

the boolean mask indicating which tokens are retrieved. The length of the mask should be the same as the tokens. On CPU.

Raises:

ValueError if the number of Falses in the mask is not a multiple of the chunk size.

store(tokens: Tensor, mask: Tensor | None = None, **kwargs) → None[source]#

Store the tokens and mask into the cache engine.

Parameters:

tokens (torch.Tensor) – The tokens of the corresponding KV caches.
mask (Optional[torch.Tensor]) – The mask for the tokens. Should have the same length as tokens. And the mask should ALWAYS be like FFFFFTTTTTTT, where True means the tokens needs to be matched, and the Falses will ALWAYS be at the PREFIX of the tensor.
**kwargs –
The additional arguments for the storage backend which will be passed into the gpu_connector. Should include KV cache specific information (e.g., paged KV buffer and the page tables).

Raises:

ValueError if the number of Falses in the mask is not a multiple of the chunk size.

class lmcache.experimental.cache_engine.LMCacheEngineBuilder[source]#

classmethod destroy(instance_id: str) → None[source]#: Close and delete the LMCacheEngine instance by the instance ID

classmethod get(instance_id: str) → LMCacheEngine | None[source]#: Returns the LMCacheEngine instance associated with the instance ID, or None if not found.

classmethod get_or_create(instance_id: str, config: LMCacheEngineConfig, metadata: LMCacheEngineMetadata, gpu_connector: GPUConnectorInterface) → LMCacheEngine[source]#

Builds a new LMCacheEngine instance if it doesn’t already exist for the given ID.

raises: ValueError if the instance already exists with a different: configuration.

lmcache.experimental.config module#

class lmcache.experimental.config.LMCacheEngineConfig(chunk_size: int, local_cpu: bool, max_local_cpu_size: float, local_disk: str | None, max_local_disk_size: float, remote_url: str | None, remote_serde: str | None, save_decode_cache: bool, enable_blending: bool, blend_recompute_ratio: float, blend_min_tokens: int, enable_p2p: bool, lookup_url: str | None, distributed_url: str | None, error_handling: bool)[source]#

blend_min_tokens: int#

blend_recompute_ratio: float#

chunk_size: int#

distributed_url: str | None#

enable_blending: bool#

enable_p2p: bool#

error_handling: bool#

static from_defaults(chunk_size: int = 256, local_cpu: bool = True, max_local_cpu_size: float = 5.0, local_disk: str | None = None, max_local_disk_size: int = 0, remote_url: str | None = 'lm://localhost:65432', remote_serde: str | None = 'naive', save_decode_cache: bool = False, enable_blending: bool = False, blend_recompute_ratio: float = 0.15, blend_min_tokens: int = 256, enable_p2p: bool = False, lookup_url: str | None = None, distributed_url: str | None = None, error_handling: bool = False) → LMCacheEngineConfig[source]#

static from_env() → LMCacheEngineConfig[source]#: Load the config from the environment variables It will first create a config by from_defaults and overwrite the configuration values from the environment variables. The environment variables should starts with LMCACHE and be in uppercase. For example, LMCACHE_CHUNK_SIZE. :note: the default configuration only uses cpu

static from_file(file_path: str) → LMCacheEngineConfig[source]#: Load the config from a yaml file

static from_legacy(chunk_size: int = 256, backend: str = 'cpu', remote_url: str | None = 'lm://localhost:65432', remote_serde: str = 'naive', save_decode_cache: bool = False, enable_blending: bool = False, blend_recompute_ratio: float = 0.15, blend_min_tokens: int = 256, max_local_disk_size: float = 0.0, enable_p2p: bool = False, lookup_url: str | None = None, distributed_url: str | None = None, error_handling: bool = False) → LMCacheEngineConfig[source]#

local_cpu: bool#

local_disk: str | None#

lookup_url: str | None#

max_local_cpu_size: float#

max_local_disk_size: float#

remote_serde: str | None#

remote_url: str | None#

save_decode_cache: bool#

to_original_config() → LMCacheEngineConfig[source]#

lmcache.experimental.gpu_connector module#

class lmcache.experimental.gpu_connector.GPUConnectorInterface[source]#

abstract from_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#

Load the data from a GPU buffer into the memory object. Sub-classes should define the format of the kwargs.

Parameters:

memory_obj (MemoryObj) – The memory object to store the data from GPU.
start (int) – The starting index of the data in the corresponding token sequence.
end (int) – The ending index of the data in the corresponding token sequence.

abstract get_shape(num_tokens: int) → Size[source]#: Get the shape of the data given the number of tokens.

abstract to_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#

Store the data in the memory object into a GPU buffer. Sub-classes should define the format of the kwargs.

Parameters:

memory_obj (MemoryObj) – The memory object to be copied into GPU.
start (int) – The starting index of the data in the corresponding token sequence.
end (int) – The ending index of the data in the corresponding token sequence.

class lmcache.experimental.gpu_connector.VLLMNestedTupleGPUConnector(hidden_dim_size: int, num_layers: int)[source]#

Bases: GPUConnectorInterface

The GPU KV cache should be a nested tuple of K and V tensors. More specifically, we have: - GPUTensor = Tuple[KVLayer, …] - KVLayer = Tuple[Tensor, Tensor] - Tensor: [num_tokens, …]

The token dimension is specified by token_dim when constructing the connector.

It will produce / consume memory object with KV_BLOB format

from_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#

Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.

Raises:

ValueError – If ‘kvcaches’ is not provided in kwargs, or the memory object is not in KV_BLOB format.
AssertionError – If the memory object does not have a tensor.

get_shape(num_tokens: int) → Size[source]#: Get the shape of the data given the number of tokens.

to_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#

Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.

Raises:

ValueError – If ‘kvcaches’ is not provided in kwargs.
AssertionError – If the memory object does not have a tensor.

class lmcache.experimental.gpu_connector.VLLMPagedMemGPUConnector(hidden_dim_size: int, num_layers: int)[source]#

Bases: GPUConnectorInterface

The GPU KV cache should be a nested tuple of K and V tensors. More specifically, we have: - GPUTensor = Tuple[KVLayer, …] - KVLayer = Tuple[Tensor, Tensor] - Tensor: [num_blocks, block_size, num_heads, head_size]

It will produce / consume memory object with KV_BLOB format

from_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#

Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.

Raises:

ValueError – If ‘kvcaches’ is not provided in kwargs, or the memory object is not in KV_BLOB format.
AssertionError – If the memory object does not have a tensor.
ValueError – If ‘slot_mapping’ is not provided in kwargs.

get_shape(num_tokens: int) → Size[source]#: Get the shape of the data given the number of tokens.

to_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#

Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.

Raises:

ValueError – If ‘kvcaches’ is not provided in kwargs.
AssertionError – If the memory object does not have a tensor.
ValueError – If ‘slot_mapping’ is not provided in kwargs.

class lmcache.experimental.gpu_connector.VLLMPagedMemGPUConnectorV2(hidden_dim_size: int, num_layers: int, use_gpu: bool = False, **kwargs)[source]#

Bases: GPUConnectorInterface

It will produce / consume memory object with KV_BLOB format

from_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#

Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.

Will set the memory_obj.metadata.fmt to MemoryFormat.KV_BLOB.

Note

This function expects the ‘slot_mapping’ is a “full slot mapping” where it’s length is the same as the whole token sequence.
In the case that there is prefix caching, slot_mapping will starts with -1s until the end of the matched prefix. The start and end should NEVER overlap with the prefix caching (which means the underlying CUDA kernel will never see -1 in slot_mapping)

Raises:

ValueError – If ‘kvcaches’ is not provided in kwargs,
AssertionError – If the memory object does not have a tensor.
ValueError – If ‘slot_mapping’ is not provided in kwargs.

get_shape(num_tokens: int) → Size[source]#: Get the shape of the data given the number of tokens.

to_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#

Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.

Note

This function expects the ‘slot_mapping’ is a “full slot mapping” where it’s length is the same as the whole token sequence.
In the case that there is prefix caching, slot_mapping will starts with -1s until the end of the matched prefix. The start and end should NEVER overlap with the prefix caching (which means the underlying CUDA kernel will never see -1 in slot_mapping)

Raises:

ValueError – If ‘kvcaches’ is not provided in kwargs.
AssertionError – If the memory object does not have a tensor.
ValueError – If ‘slot_mapping’ is not provided in kwargs.

lmcache.experimental.memory_management module#

class lmcache.experimental.memory_management.BufferAllocator(device='cpu')[source]#

Bases: MemoryAllocatorInterface

Allocates memory in the pre-allocated pinned memory.

allocate(shape: Size | Tuple[int, ...], dtype: dtype | None, fmt: MemoryFormat = MemoryFormat.BINARY_BUFFER) → BytesBufferMemoryObj[source]#

Allocates the memory to hold a tensor of the given shape.

Parameters:

shape (torch.Size) – The shape of the tensor to allocate.
dtype (torch.dtype) – The dtype of the tensor to allocate.
fmt (MemoryFormat) – The format of the memory to allocate.

Returns:

A MemoryObj wrapping the allocated memory. Returns None if the allocation failed.

Return type:

Optional[MemoryObj]

free(memory_obj: MemoryObj)[source]#

Frees the memory allocated for the given MemoryObj. Note that this function shouldn’t be explicitly called. Instead, use ref_count_down to decrease ref count.

Parameters:: memory_obj (MemoryObj) – The MemoryObj to free.

get_ref_count(memory_obj: MemoryObj)[source]#

Get ref count for the given MemoryObj.

:param MemoryObj memory_obj.

memcheck()[source]#

ref_count_down(memory_obj: MemoryObj)[source]#

Decrease ref count for the given MemoryObj.

:param MemoryObj memory_obj.

ref_count_up(memory_obj: MemoryObj)[source]#

Increase ref count for the given MemoryObj.

:param MemoryObj memory_obj.

class lmcache.experimental.memory_management.BytesBufferMemoryObj(raw_bytes: bytes, metadata: MemoryObjMetadata | None = None)[source]#

Bases: MemoryObj

Wraps a raw flat tensor with some metadata

property byte_array: bytes#: Get the byte array from the MemoryObj.

get_dtype() → dtype | None[source]#: Get the dtype of the MemoryObj.

get_memory_format() → MemoryFormat[source]#: Get the memory format of the MemoryObj.

get_physical_size() → int[source]#: Get the physical size of the MemoryObj in bytes.

get_shape() → Size[source]#: Get the shape of the MemoryObj.

get_size() → int[source]#: Get the size of the MemoryObj in bytes.

invalidate()[source]#: Invalidate the MemoryObj.

is_valid()[source]#: Check if the MemoryObj is valid.

property metadata: MemoryObjMetadata#: Get the metada of the MemoryObj.

property tensor: Tensor | None#: Get the tensor from the MemoryObj.

class lmcache.experimental.memory_management.FreeBlock(start: int, size: int)[source]#

Metadata class used by the memory allocators

can_be_coalesced(succ: FreeBlock) → bool[source]#

size: int#

start: int#

class lmcache.experimental.memory_management.GPUMemoryAllocator(size: int, device='cuda')[source]#

Bases: MemoryAllocatorInterface

Allocates memory in the pre-allocated Host memory.

allocate(shape: Size | Tuple[int, ...], dtype: dtype | None, fmt: MemoryFormat = MemoryFormat.KV_BLOB) → MemoryObj | None[source]#

Allocates the memory to hold a tensor of the given shape.

Parameters:

shape (torch.Size) – The shape of the tensor to allocate.
dtype (torch.dtype) – The dtype of the tensor to allocate.
fmt (MemoryFormat) – The format of the memory to allocate.

Returns:

A MemoryObj wrapping the allocated memory. Returns None if the allocation failed.

Return type:

Optional[MemoryObj]

free(memory_obj: MemoryObj)[source]#

Frees the memory allocated for the given MemoryObj. Note that this function shouldn’t be explicitly called. Instead, use ref_count_down to decrease ref count.

Parameters:: memory_obj (MemoryObj) – The MemoryObj to free.

get_ref_count(memory_obj: MemoryObj)[source]#

Get ref count for the given MemoryObj.

:param MemoryObj memory_obj.

memcheck()[source]#

ref_count_down(memory_obj: MemoryObj)[source]#

Decrease ref count for the given MemoryObj.

:param MemoryObj memory_obj.

ref_count_up(memory_obj: MemoryObj)[source]#

Increase ref count for the given MemoryObj.

:param MemoryObj memory_obj.

class lmcache.experimental.memory_management.HostMemoryAllocator(size: int)[source]#

Bases: MemoryAllocatorInterface

Allocates memory in the pre-allocated Host memory.

allocate(shape: Size | Tuple[int, ...], dtype: dtype | None, fmt: MemoryFormat = MemoryFormat.KV_BLOB) → MemoryObj | None[source]#

Allocates the memory to hold a tensor of the given shape.

Parameters:

shape (torch.Size) – The shape of the tensor to allocate.
dtype (torch.dtype) – The dtype of the tensor to allocate.
fmt (MemoryFormat) – The format of the memory to allocate.

Returns:

A MemoryObj wrapping the allocated memory. Returns None if the allocation failed.

Return type:

Optional[MemoryObj]

free(memory_obj: MemoryObj)[source]#

Frees the memory allocated for the given MemoryObj. Note that this function shouldn’t be explicitly called. Instead, use ref_count_down to decrease ref count.

Parameters:: memory_obj (MemoryObj) – The MemoryObj to free.

get_ref_count(memory_obj: MemoryObj)[source]#

Get ref count for the given MemoryObj.

:param MemoryObj memory_obj.

memcheck()[source]#

ref_count_down(memory_obj: MemoryObj)[source]#

Decrease ref count for the given MemoryObj.

:param MemoryObj memory_obj.

ref_count_up(memory_obj: MemoryObj)[source]#

Increase ref count for the given MemoryObj.

:param MemoryObj memory_obj.

class lmcache.experimental.memory_management.MemoryAllocatorInterface[source]#

abstract allocate(shape: Size | Tuple[int, ...], dtype: dtype | None, fmt: MemoryFormat = MemoryFormat.UNDEFINED) → MemoryObj | None[source]#

Allocates the memory to hold a tensor of the given shape.

Parameters:

shape (torch.Size) – The shape of the tensor to allocate.
dtype (torch.dtype) – The dtype of the tensor to allocate.
fmt (MemoryFormat) – The format of the memory to allocate.

Returns:

A MemoryObj wrapping the allocated memory. Returns None if the allocation failed.

Return type:

Optional[MemoryObj]

abstract free(memory_obj: MemoryObj)[source]#

Frees the memory allocated for the given MemoryObj. Note that this function shouldn’t be explicitly called. Instead, use ref_count_down to decrease ref count.

Parameters:: memory_obj (MemoryObj) – The MemoryObj to free.

abstract get_ref_count(memory_obj: MemoryObj)[source]#

Get ref count for the given MemoryObj.

:param MemoryObj memory_obj.

abstract ref_count_down(memory_obj: MemoryObj)[source]#

Decrease ref count for the given MemoryObj.

:param MemoryObj memory_obj.

abstract ref_count_up(memory_obj: MemoryObj)[source]#

Increase ref count for the given MemoryObj.

:param MemoryObj memory_obj.

class lmcache.experimental.memory_management.MemoryFormat(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

BINARY = 2#

BINARY_BUFFER = 3#

KV_BLOB = 1#: Compressed binary array format

UNDEFINED = 0#: [2, num_layers, num_tokens, hidden_dim]

token_dim() → int[source]#

class lmcache.experimental.memory_management.MemoryObj[source]#

MemoryObj interface.

abstract property byte_array: bytes#: Get the byte array from the MemoryObj.

get_dtype() → dtype | None[source]#: Get the dtype of the MemoryObj.

abstract get_memory_format() → MemoryFormat[source]#: Get the memory format of the MemoryObj.

abstract get_physical_size() → int[source]#: Get the physical size of the MemoryObj in bytes.

abstract get_shape() → Size[source]#: Get the shape of the MemoryObj.

abstract get_size() → int[source]#: Get the size of the MemoryObj in bytes.

abstract invalidate()[source]#: Invalidate the MemoryObj.

abstract is_valid()[source]#: Check if the MemoryObj is valid.

abstract property metadata: MemoryObjMetadata#: Get the metada of the MemoryObj.

abstract property tensor: Tensor | None#: Get the tensor from the MemoryObj.

class lmcache.experimental.memory_management.MemoryObjMetadata(shape: torch.Size, dtype: Optional[torch.dtype], address: int, phy_size: int, ref_count: int, fmt: lmcache.experimental.memory_management.MemoryFormat = <MemoryFormat.UNDEFINED: 0>)[source]#

address: int#

dtype: dtype | None#

fmt: MemoryFormat = 0#

phy_size: int#

ref_count: int#

shape: Size#

class lmcache.experimental.memory_management.MixedMemoryAllocator(size: int)[source]#

Bases: MemoryAllocatorInterface

Allocates (1) memory in the pre-allocated pinned memory.

byte_array buffer memory.

allocate(shape: Size | Tuple[int, ...], dtype: dtype | None, fmt: MemoryFormat = MemoryFormat.KV_BLOB) → MemoryObj | None[source]#

Allocates the memory to hold a tensor of the given shape.

Parameters:

shape (torch.Size) – The shape of the tensor to allocate.
dtype (torch.dtype) – The dtype of the tensor to allocate.
fmt (MemoryFormat) – The format of the memory to allocate.

Returns:

A MemoryObj wrapping the allocated memory. Returns None if the allocation failed.

Return type:

Optional[MemoryObj]

free(memory_obj: MemoryObj)[source]#

Frees the memory allocated for the given MemoryObj. Note that this function shouldn’t be explicitly called. Instead, use ref_count_down to decrease ref count.

Parameters:: memory_obj (MemoryObj) – The MemoryObj to free.

get_ref_count(memory_obj: MemoryObj)[source]#

Get ref count for the given MemoryObj.

:param MemoryObj memory_obj.

memcheck()[source]#

ref_count_down(memory_obj: MemoryObj)[source]#

Decrease ref count for the given MemoryObj.

:param MemoryObj memory_obj.

ref_count_up(memory_obj: MemoryObj)[source]#

Increase ref count for the given MemoryObj.

:param MemoryObj memory_obj.

class lmcache.experimental.memory_management.PinMemoryAllocator(size: int)[source]#

Bases: MemoryAllocatorInterface

Allocates memory in the pre-allocated pinned memory.

allocate(shape: Size | Tuple[int, ...], dtype: dtype | None, fmt: MemoryFormat = MemoryFormat.KV_BLOB) → MemoryObj | None[source]#

Allocates the memory to hold a tensor of the given shape.

Parameters:

shape (torch.Size) – The shape of the tensor to allocate.
dtype (torch.dtype) – The dtype of the tensor to allocate.
fmt (MemoryFormat) – The format of the memory to allocate.

Returns:

A MemoryObj wrapping the allocated memory. Returns None if the allocation failed.

Return type:

Optional[MemoryObj]

free(memory_obj: MemoryObj)[source]#

Frees the memory allocated for the given MemoryObj. Note that this function shouldn’t be explicitly called. Instead, use ref_count_down to decrease ref count.

Parameters:: memory_obj (MemoryObj) – The MemoryObj to free.

get_ref_count(memory_obj: MemoryObj)[source]#

Get ref count for the given MemoryObj.

:param MemoryObj memory_obj.

memcheck()[source]#

ref_count_down(memory_obj: MemoryObj)[source]#

Decrease ref count for the given MemoryObj.

:param MemoryObj memory_obj.

ref_count_up(memory_obj: MemoryObj)[source]#

Increase ref count for the given MemoryObj.

:param MemoryObj memory_obj.

class lmcache.experimental.memory_management.TensorMemoryAllocator(tensor: Tensor)[source]#

Bases: MemoryAllocatorInterface

Implements a “explicit list” memory allocator.

ALIGN_BYTES = 512#

allocate(shape: Size | Tuple[int, ...], dtype: dtype | None, fmt: MemoryFormat = MemoryFormat.KV_BLOB) → TensorMemoryObj | None[source]#

Allocates the memory to hold a tensor of the given shape.

Parameters:

shape (torch.Size) – The shape of the tensor to allocate.
dtype (torch.dtype) – The dtype of the tensor to allocate.
fmt (MemoryFormat) – The format of the memory to allocate.

Returns:

A MemoryObj wrapping the allocated memory. Returns None if the allocation failed.

Return type:

Optional[MemoryObj]

free(memory_obj: MemoryObj)[source]#

Frees the memory allocated for the given MemoryObj. Note that this function shouldn’t be explicitly called. Instead, use ref_count_down to decrease ref count.

Parameters:: memory_obj (MemoryObj) – The MemoryObj to free.

get_ref_count(memory_obj: MemoryObj)[source]#

Get ref count for the given MemoryObj.

:param MemoryObj memory_obj.

memcheck()[source]#: For debug purposes. Returns True is everything is fine, otherwise False.

ref_count_down(memory_obj: MemoryObj)[source]#

Decrease ref count for the given MemoryObj.

:param MemoryObj memory_obj.

ref_count_up(memory_obj: MemoryObj)[source]#

Increase ref count for the given MemoryObj.

:param MemoryObj memory_obj.

class lmcache.experimental.memory_management.TensorMemoryObj(raw_data: Tensor, metadata: MemoryObjMetadata)[source]#

Bases: MemoryObj

Wraps a raw flat tensor with some metadata

property byte_array: bytes#: Get the byte array from the MemoryObj.

get_dtype() → dtype[source]#: Get the dtype of the MemoryObj.

get_memory_format() → MemoryFormat[source]#: Get the memory format of the MemoryObj.

get_physical_size() → int[source]#: Get the physical size of the MemoryObj in bytes.

get_shape() → Size[source]#: Get the shape of the MemoryObj.

get_size() → int[source]#: Get the size of the MemoryObj in bytes.

invalidate()[source]#: Invalidate the MemoryObj.

is_valid()[source]#: Check if the MemoryObj is valid.

property metadata: MemoryObjMetadata#: Get the metada of the MemoryObj.

property tensor: Tensor | None#: Get the tensor from the MemoryObj.

lmcache.experimental.token_database module#

class lmcache.experimental.token_database.ChunkedTokenDatabase(config: LMCacheEngineConfig, metadata: LMCacheEngineMetadata)[source]#

Bases: TokenDatabase

process_tokens(tokens: Tensor, mask: Tensor | None = None) → Iterable[Tuple[int, int, CacheEngineKey]][source]#

Process the tokens and return the corresponding cache engine keys.

Parameters:

tokens (torch.Tensor) – The tokens to process, in 1-D CPU tensor.
mask (Optional[torch.Tensor]) – The mask for the tokens. Should have the same length as tokens. And the mask should ALWAYS be like FFFFFTTTTTTT, where True means the tokens needs to be matched, and the Falses will ALWAYS be at the PREFIX of the tensor.

Returns:

A iterable of tuples with three elements. The first element is the start index of the tokens for the key. The second element is the end index of the tokens for the key. The third element is the cache engine key for the tokens.

Raises:

ValueError if the number of Falses in the mask is not a multiple of the chunk size.

class lmcache.experimental.token_database.TokenDatabase[source]#

TokenDatabase is used to convert input tokens into list of cache engine keys. There are multiple ways to implement this:

ChunkedTokenDatabase: It processes tokens into chunks and convert

each chunk into a cache engine key using prefix hash.

RadixTokenDatabase: more advanced implementation using radix tree.

abstract process_tokens(tokens: Tensor, mask: Tensor | None = None) → Iterable[Tuple[int, int, CacheEngineKey]][source]#

Process the tokens and return the corresponding cache engine keys.

Parameters:

tokens (torch.Tensor) – The tokens to process, in 1-D CPU tensor.
mask (Optional[torch.Tensor]) – The mask for the tokens. Should have the same length as tokens. And the mask should ALWAYS be like FFFFFTTTTTTT, where True means the tokens needs to be matched, and the Falses will ALWAYS be at the PREFIX of the tensor.

Returns:

Module contents#

lmcache.storage_backend.mem_pool package

lmcache.experimental.storage_backend package