LMCache Backend#

class lmcache.storage_backend.abstract_backend.LMCBackendInterface(dst_device: str = 'cuda')[source]#
batched_get(keys: Iterable[CacheEngineKey]) Iterable[Tensor | None][source]#

Retrieve the kv cache chunks by the given keys in a batched manner

Parameters:

keys – the iterator of keys of the token chunks, including prefix hash and format

Returns:

the iterator of kv cache of the token chunks, in the format of a big tensor and None if the key is not found

batched_put(keys_and_chunks: Iterable[Tuple[CacheEngineKey, Tensor]], blocking=True) int[source]#

Store the multiple keys and KV cache chunks into the cache engine in a batched manner.

Parameters:
  • keys – the iterable of keys of the token chunks, in the format of CacheEngineKey

  • kv_chunks – the iterable of kv cache of the token chunks, in the format of a big tensor

  • blocking – whether to block the call before the operation is completed

Returns:

the number of chunks are stored

abstract close()[source]#

Do the cleanup things Children classes should override this method if necessary

abstract contains(key: CacheEngineKey) bool[source]#

Query if a key is in the cache or not

abstract get(key: CacheEngineKey) Tensor | None[source]#

Retrieve the KV cache chunk by the given key

Parameters:

key – the key of the token chunk, including prefix hash and format

Returns:

the kv cache of the token chunk, in the format of a big tensor and None if the key is not found

abstract put(key: CacheEngineKey, kv_chunk: Tensor, blocking=True) None[source]#

Store the KV cache of the tokens into the cache engine.

Parameters:
  • key – the key of the token chunk, in the format of CacheEngineKey

  • kv_chunk – the kv cache of the token chunk, as a big tensor.

  • blocking – to block the call before the operation is completed.

Returns:

None

Note

The KV cache should NOT have the “batch” dimension.