LMCache Engine Interface#

class LMCacheEngine(config: LMCacheEngineConfig, metadata: LMCacheEngineMetadata, memory_allocator: MemoryAllocatorInterface, token_database: TokenDatabase, gpu_connector: GPUConnectorInterface)[source]#

The main class for the cache engine.

When storing the KV caches into the cache engine, it takes GPU KV caches from the serving engine and convert them into MemoryObjs that resides in the CPU. The MemoryObjs are then being stored into the StorageBackends in an asynchronous manner.

When retrieving the KV caches from the cache engine, it fetches the MemoryObjs from the StorageBackends and convert them into GPU KV caches by GPUConnectors specialized for the serving engine.

It also supports prefetching the KV caches from the StorageBackends. It relies on the StorageBackends to manage the requests of prefetching and real retrieval and avoid the conflicts.

close() None[source]#

Close the cache engine and free all the resources

lookup(tokens: Tensor, search_range: List[str] | None = None) int[source]#

Checks the existence of KV cache of the tokens from the cache engine.

Parameters:
  • tokens – the input tokens, with shape [seq_len]

  • search_range (Optional[List[str]]) – The range of storage backends

to search in. Should be a subset of [“Hot”, “LocalDiskBackend”] for now. If None, search in all backends.

Returns:

An int indicating how many prefix tokens are cached.

prefetch(tokens: Tensor, mask: Tensor | None = None) None[source]#

Launch the prefetching process in the storage manager to load the KV to the local CPU memory

retrieve(tokens: Tensor, mask: Tensor | None = None, **kwargs) Tensor[source]#

Retrieve the KV caches from the cache engine. And put the retrieved KV cache to the serving engine via the GPU connector.

Parameters:
  • tokens (torch.Tensor) – The tokens of the corresponding KV caches.

  • mask (Optional[torch.Tensor]) – The mask for the tokens. Should have the same length as tokens. And the mask should ALWAYS be like FFFFFTTTTTTT, where True means the tokens needs to be matched, and the Falses will ALWAYS be at the PREFIX of the tensor.

  • **kwargs

    The additional arguments for the storage backend which will be passed into the gpu_connector. Should include KV cache specific information (e.g., paged KV buffer and the page tables).

Returns:

the boolean mask indicating which tokens are retrieved. The length of the mask should be the same as the tokens. On CPU.

Raises:

ValueError if the number of Falses in the mask is not a multiple of the chunk size.

store(tokens: Tensor, mask: Tensor | None = None, **kwargs) None[source]#

Store the tokens and mask into the cache engine.

Parameters:
  • tokens (torch.Tensor) – The tokens of the corresponding KV caches.

  • mask (Optional[torch.Tensor]) – The mask for the tokens. Should have the same length as tokens. And the mask should ALWAYS be like FFFFFTTTTTTT, where True means the tokens needs to be matched, and the Falses will ALWAYS be at the PREFIX of the tensor.

  • **kwargs

    The additional arguments for the storage backend which will be passed into the gpu_connector. Should include KV cache specific information (e.g., paged KV buffer and the page tables).

Raises:

ValueError if the number of Falses in the mask is not a multiple of the chunk size.