LMCache Storage Backend#

Subpackages#

Submodules#

lmcache.storage_backend.abstract_backend module#

class lmcache.storage_backend.abstract_backend.LMCBackendInterface(dst_device: str = 'cuda')[source]#
batched_get(keys: Iterable[CacheEngineKey]) Iterable[Tensor | None][source]#

Retrieve the kv cache chunks by the given keys in a batched manner

Parameters:

keys – the iterator of keys of the token chunks, including prefix hash and format

Returns:

the iterator of kv cache of the token chunks, in the format of a big tensor and None if the key is not found

batched_put(keys_and_chunks: Iterable[Tuple[CacheEngineKey, Tensor]], blocking=True) int[source]#

Store the multiple keys and KV cache chunks into the cache engine in a batched manner.

Parameters:
  • keys – the iterable of keys of the token chunks, in the format of CacheEngineKey

  • kv_chunks – the iterable of kv cache of the token chunks, in the format of a big tensor

  • blocking – whether to block the call before the operation is completed

Returns:

the number of chunks are stored

abstract close()[source]#

Do the cleanup things Children classes should override this method if necessary

abstract contains(key: CacheEngineKey) bool[source]#

Query if a key is in the cache or not

abstract get(key: CacheEngineKey) Tensor | None[source]#

Retrieve the KV cache chunk by the given key

Parameters:

key – the key of the token chunk, including prefix hash and format

Returns:

the kv cache of the token chunk, in the format of a big tensor and None if the key is not found

abstract put(key: CacheEngineKey, kv_chunk: Tensor, blocking=True) None[source]#

Store the KV cache of the tokens into the cache engine.

Parameters:
  • key – the key of the token chunk, in the format of CacheEngineKey

  • kv_chunk – the kv cache of the token chunk, as a big tensor.

  • blocking – to block the call before the operation is completed.

Returns:

None

Note

The KV cache should NOT have the “batch” dimension.

lmcache.storage_backend.hybrid_backend module#

class lmcache.storage_backend.hybrid_backend.LMCHybridBackend(config: LMCacheEngineConfig, metadata: LMCacheEngineMetadata, mpool_metadata: LMCacheMemPoolMetadata, dst_device: str = 'cuda')[source]#

Bases: LMCBackendInterface

A hybrid backend that uses both local and remote backend to store and retrieve data. It implements write-through and read-through caching.

batched_get(keys: Iterable[CacheEngineKey]) Iterable[Tensor | None][source]#

Retrieve the kv cache chunks by the given keys in a batched manner

Parameters:

keys – the iterator of keys of the token chunks, including prefix hash and format

Returns:

the iterator of kv cache of the token chunks, in the format of a big tensor and None if the key is not found

close()[source]#

Do the cleanup things Children classes should override this method if necessary

contains(key: CacheEngineKey) bool[source]#

Query if a key is in the cache or not

get(key: CacheEngineKey) Tensor | None[source]#

Retrieve the KV cache chunk by the given key

Parameters:

key – the key of the token chunk, including prefix hash and format

Returns:

the kv cache of the token chunk, in the format of a big tensor and None if the key is not found

put(key: CacheEngineKey, value: Tensor, blocking: bool = True)[source]#

Store the KV cache of the tokens into the cache engine.

Parameters:
  • key – the key of the token chunk, in the format of CacheEngineKey

  • kv_chunk – the kv cache of the token chunk, as a big tensor.

  • blocking – to block the call before the operation is completed.

Returns:

None

Note

The KV cache should NOT have the “batch” dimension.

lmcache.storage_backend.local_backend module#

class lmcache.storage_backend.local_backend.LMCLocalBackend(config: LMCacheEngineConfig, metadata: LMCacheMemPoolMetadata, dst_device: str = 'cuda')[source]#

Bases: LMCBackendInterface

Cache engine for storing the KV cache of the tokens in the local cpu/gpu memory.

close()[source]#

Do the cleanup things Children classes should override this method if necessary

contains(key: CacheEngineKey) bool[source]#

Check if the cache engine contains the key.

Input:

key: the key of the token chunk, including prefix hash and format

Returns:

True if the cache engine contains the key, False otherwise

get(key: CacheEngineKey) Tensor | None[source]#

Retrieve the KV cache chunk by the given key

Input:

key: the key of the token chunk, including prefix hash and format

Output:

the kv cache of the token chunk, in the format of nested tuples None if the key is not found

put(key: CacheEngineKey, kv_chunk: Tensor, blocking: bool = True) None[source]#

Store the KV cache of the tokens into the cache engine.

Input:

key: the key of the token chunk, including prefix hash and format kv_chunk: the kv cache of the token chunk, in the format of nested tuples

Returns:

None

Note

The KV cache should NOT have the “batch” dimension.

put_blocking(key, kv_chunk)[source]#
put_nonblocking(key, kv_chunk)[source]#
put_worker()[source]#
remove(key: CacheEngineKey) None[source]#

Remove the KV cache chunk by the given key

Input:

key: the key of the token chunk, including prefix hash and format

class lmcache.storage_backend.local_backend.LMCLocalDiskBackend(config: LMCacheEngineConfig, metadata: LMCacheMemPoolMetadata, dst_device: str = 'cuda')[source]#

Bases: LMCBackendInterface

Cache engine for storing the KV cache of the tokens in the local disk.

buffer_sweeper()[source]#

Sweep the future pool to free up memory.

close()[source]#

Do the cleanup things Children classes should override this method if necessary

contains(key: CacheEngineKey) bool[source]#

Check if the cache engine contains the key.

Input:

key: the key of the token chunk, including prefix hash and format

Returns:

True if the cache engine contains the key, False otherwise

get(key: CacheEngineKey) Tuple[Tuple[Tensor, Tensor], ...] | None[source]#

Retrieve the KV cache chunk by the given key

Input:

key: the key of the token chunk, including prefix hash and format

Output:

the kv cache of the token chunk, in the format of nested tuples None if the key is not found

put(key: CacheEngineKey, kv_chunk: Tensor, blocking: bool = True) None[source]#

Store the KV cache of the tokens into the cache engine.

Input:

key: the key of the token chunk, including prefix hash and format kv_chunk: the kv cache of the token chunk, in the format of nested tuples

Returns:

None

Note

The KV cache should NOT have the “batch” dimension.

put_blocking(key: CacheEngineKey, kv_chunk: Tensor) None[source]#
put_nonblocking(key: CacheEngineKey, kv_chunk: Tensor) None[source]#
put_worker()[source]#
remove(key: CacheEngineKey) None[source]#

Remove the KV cache chunk by the given key

Input:

key: the key of the token chunk, including prefix hash and format

class lmcache.storage_backend.local_backend.LocalBackendEndSignal[source]#
lmcache.storage_backend.local_backend.save_disk(path: str, kv_chunk: Tensor)[source]#

lmcache.storage_backend.remote_backend module#

class lmcache.storage_backend.remote_backend.LMCPipelinedRemoteBackend(config: LMCacheEngineConfig, metadata: LMCacheEngineMetadata, dst_device: str = 'cuda')[source]#

Bases: LMCRemoteBackend

Implements the pipelined get functionality for the remote backend.

batched_get(keys: Iterator[CacheEngineKey]) Iterable[Tensor | None][source]#

Retrieve the kv cache chunks by the given keys in a batched manner

Parameters:

keys – the iterator of keys of the token chunks, including prefix hash and format

Returns:

the iterator of kv cache of the token chunks, in the format of a big tensor and None if the key is not found

close()[source]#

Do the cleanup things Children classes should override this method if necessary

deserialize_worker()[source]#
network_worker()[source]#
class lmcache.storage_backend.remote_backend.LMCRemoteBackend(config: LMCacheEngineConfig, metadata: LMCacheEngineMetadata, dst_device: str = 'cuda')[source]#

Bases: LMCBackendInterface

Cache engine for storing the KV cache of the tokens in the remote server.

close()[source]#

Do the cleanup things Children classes should override this method if necessary

contains(key: CacheEngineKey) bool[source]#

Check if the cache engine contains the key.

Input:

key: the key of the token chunk, including prefix hash and format

Returns:

True if the cache engine contains the key, False otherwise

get(key: CacheEngineKey) Tensor | None[source]#

Retrieve the KV cache chunk (in a single big tensor) by the given key

list() List[CacheEngineKey][source]#

list the remote keys (and also update the ‘cached’ existing keys set)

put(key: CacheEngineKey, kv_chunk: Tensor, blocking: bool = True) None[source]#

Store the KV cache of the tokens into the cache engine.

Input:

key: the key of the token chunk, including prefix hash and format kv_chunk: the kv cache of the token chunk, in a single big tensor blocking: whether to block until the put is done

Returns:

None

Note

The KV cache should NOT have the “batch” dimension.

put_blocking(key: CacheEngineKey, kv_chunk: Tensor) None[source]#
put_worker()[source]#
class lmcache.storage_backend.remote_backend.RemoteBackendEndSignal[source]#