LMCache GPU Connector Interface#
- class GPUConnectorInterface[source]#
- abstract from_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#
Load the data from a GPU buffer into the memory object. Sub-classes should define the format of the kwargs.
- abstract get_shape(num_tokens: int) Size [source]#
Get the shape of the data given the number of tokens.
- class VLLMNestedTupleGPUConnector(hidden_dim_size: int, num_layers: int)[source]#
Bases:
GPUConnectorInterface
The GPU KV cache should be a nested tuple of K and V tensors. More specifically, we have: - GPUTensor = Tuple[KVLayer, …] - KVLayer = Tuple[Tensor, Tensor] - Tensor: [num_tokens, …]
The token dimension is specified by token_dim when constructing the connector.
It will produce / consume memory object with KV_BLOB format
- from_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#
Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.
- Raises:
ValueError – If ‘kvcaches’ is not provided in kwargs, or the memory object is not in KV_BLOB format.
AssertionError – If the memory object does not have a tensor.
- to_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#
Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.
- Raises:
ValueError – If ‘kvcaches’ is not provided in kwargs.
AssertionError – If the memory object does not have a tensor.
- class VLLMPagedMemGPUConnector(hidden_dim_size: int, num_layers: int)[source]#
Bases:
GPUConnectorInterface
The GPU KV cache should be a nested tuple of K and V tensors. More specifically, we have: - GPUTensor = Tuple[KVLayer, …] - KVLayer = Tuple[Tensor, Tensor] - Tensor: [num_blocks, block_size, num_heads, head_size]
It will produce / consume memory object with KV_BLOB format
- from_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#
Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.
- Raises:
ValueError – If ‘kvcaches’ is not provided in kwargs, or the memory object is not in KV_BLOB format.
AssertionError – If the memory object does not have a tensor.
ValueError – If ‘slot_mapping’ is not provided in kwargs.
- to_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#
Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.
- Raises:
ValueError – If ‘kvcaches’ is not provided in kwargs.
AssertionError – If the memory object does not have a tensor.
ValueError – If ‘slot_mapping’ is not provided in kwargs.
- class VLLMPagedMemGPUConnectorV2(hidden_dim_size: int, num_layers: int, use_gpu: bool = False, **kwargs)[source]#
Bases:
GPUConnectorInterface
The GPU KV cache should be a nested tuple of K and V tensors. More specifically, we have: - GPUTensor = Tuple[KVLayer, …] - KVLayer = Tuple[Tensor, Tensor] - Tensor: [num_blocks, block_size, num_heads, head_size]
It will produce / consume memory object with KV_BLOB format
- from_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#
Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.
Will set the memory_obj.metadata.fmt to MemoryFormat.KV_BLOB.
Note
This function expects the ‘slot_mapping’ is a “full slot mapping” where it’s length is the same as the whole token sequence.
In the case that there is prefix caching, slot_mapping will starts with -1s until the end of the matched prefix. The start and end should NEVER overlap with the prefix caching (which means the underlying CUDA kernel will never see -1 in slot_mapping)
- Raises:
ValueError – If ‘kvcaches’ is not provided in kwargs,
AssertionError – If the memory object does not have a tensor.
ValueError – If ‘slot_mapping’ is not provided in kwargs.
- to_gpu(memory_obj: MemoryObj, start: int, end: int, **kwargs)[source]#
Expect a kwarg ‘kvcaches’ which is a nested tuple of K and V tensors. The kvcaches should correspond to the “WHOLE token sequence”.
Note
This function expects the ‘slot_mapping’ is a “full slot mapping” where it’s length is the same as the whole token sequence.
In the case that there is prefix caching, slot_mapping will starts with -1s until the end of the matched prefix. The start and end should NEVER overlap with the prefix caching (which means the underlying CUDA kernel will never see -1 in slot_mapping)
- Raises:
ValueError – If ‘kvcaches’ is not provided in kwargs.
AssertionError – If the memory object does not have a tensor.
ValueError – If ‘slot_mapping’ is not provided in kwargs.