LMCache GPU Connector Interface#

class GPUConnectorInterface[source]#

abstract from_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#

Load the data from a GPU buffer into the memory object. Sub-classes should define the format of the kwargs.

Parameters:

memory_obj (MemoryObj) – The memory object to store the data from GPU.
start (int) – The starting index of the data in the corresponding token sequence.
end (int) – The ending index of the data in the corresponding token sequence.

abstract get_shape(num_tokens: int) → Size[source]#: Get the shape of the data given the number of tokens.

abstract to_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#

Store the data in the memory object into a GPU buffer. Sub-classes should define the format of the kwargs.

Parameters:

memory_obj (MemoryObj) – The memory object to be copied into GPU.
start (int) – The starting index of the data in the corresponding token sequence.
end (int) – The ending index of the data in the corresponding token sequence.

class VLLMNestedTupleGPUConnector(hidden_dim_size: int, num_layers: int)[source]#

Bases: GPUConnectorInterface

The GPU KV cache should be a nested tuple of K and V tensors. More specifically, we have: - GPUTensor = Tuple[KVLayer, …] - KVLayer = Tuple[Tensor, Tensor] - Tensor: [num_tokens, …]

The token dimension is specified by token_dim when constructing the connector.

It will produce / consume memory object with KV_BLOB format

from_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#

Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.

Raises:

ValueError – If ‘kvcaches’ is not provided in kwargs, or the memory object is not in KV_BLOB format.
AssertionError – If the memory object does not have a tensor.

get_shape(num_tokens: int) → Size[source]#: Get the shape of the data given the number of tokens.

to_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#

Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.

Raises:

ValueError – If ‘kvcaches’ is not provided in kwargs.
AssertionError – If the memory object does not have a tensor.

class VLLMPagedMemGPUConnector(hidden_dim_size: int, num_layers: int)[source]#

Bases: GPUConnectorInterface

The GPU KV cache should be a nested tuple of K and V tensors. More specifically, we have: - GPUTensor = Tuple[KVLayer, …] - KVLayer = Tuple[Tensor, Tensor] - Tensor: [num_blocks, block_size, num_heads, head_size]

It will produce / consume memory object with KV_BLOB format

from_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#

Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.

Raises:

ValueError – If ‘kvcaches’ is not provided in kwargs, or the memory object is not in KV_BLOB format.
AssertionError – If the memory object does not have a tensor.
ValueError – If ‘slot_mapping’ is not provided in kwargs.

get_shape(num_tokens: int) → Size[source]#: Get the shape of the data given the number of tokens.

to_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#

Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.

Raises:

ValueError – If ‘kvcaches’ is not provided in kwargs.
AssertionError – If the memory object does not have a tensor.
ValueError – If ‘slot_mapping’ is not provided in kwargs.

class VLLMPagedMemGPUConnectorV2(hidden_dim_size: int, num_layers: int, use_gpu: bool = False, **kwargs)[source]#

Bases: GPUConnectorInterface

It will produce / consume memory object with KV_BLOB format

from_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#

Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.

Will set the memory_obj.metadata.fmt to MemoryFormat.KV_BLOB.

Note

This function expects the ‘slot_mapping’ is a “full slot mapping” where it’s length is the same as the whole token sequence.
In the case that there is prefix caching, slot_mapping will starts with -1s until the end of the matched prefix. The start and end should NEVER overlap with the prefix caching (which means the underlying CUDA kernel will never see -1 in slot_mapping)

Raises:

ValueError – If ‘kvcaches’ is not provided in kwargs,
AssertionError – If the memory object does not have a tensor.
ValueError – If ‘slot_mapping’ is not provided in kwargs.

get_shape(num_tokens: int) → Size[source]#: Get the shape of the data given the number of tokens.

to_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#

Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.

Note

This function expects the ‘slot_mapping’ is a “full slot mapping” where it’s length is the same as the whole token sequence.
In the case that there is prefix caching, slot_mapping will starts with -1s until the end of the matched prefix. The start and end should NEVER overlap with the prefix caching (which means the underlying CUDA kernel will never see -1 in slot_mapping)

Raises:

ValueError – If ‘kvcaches’ is not provided in kwargs.
AssertionError – If the memory object does not have a tensor.
ValueError – If ‘slot_mapping’ is not provided in kwargs.

LMCache Engine Interface

LMCache Memory Object