lmcache.integration.vllm package#

Submodules#

lmcache.integration.vllm.utils module#

lmcache.integration.vllm.utils.lmcache_get_config() LMCacheEngineConfig | LMCacheEngineConfig[source]#

Get the LMCache configuration from the environment variable LMCACHE_CONFIG_FILE. If the environment variable is not set, this function will return the default configuration.

lmcache.integration.vllm.vllm_adapter module#

class lmcache.integration.vllm.vllm_adapter.RetrieveStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

CHUNK_PREFILL = 2#
NONE = 4#
PREFILL = 1#
class lmcache.integration.vllm.vllm_adapter.StoreStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

CHUNK_PREFILL = 2#
DECODE = 3#
NONE = 5#
PREFILL = 1#
SUFFIX_PREFILL = 4#
lmcache.integration.vllm.vllm_adapter.broadcast_seq_group_list(model_input: ModelInputForGPUWithSamplingMetadata, is_driver_worker: bool) ModelInputForGPUWithSamplingMetadata[source]#

Broadcast the model_input from driver worker to non-driver workers.

Parameters:
  • model_input (ModelInputForGPUWithSamplingMetadata) – The model input for the current request.

  • is_driver_worker (bool) – Whether the code is executed in driver worker.

: return: Original model_input if driver_worker.

Broadcasted model_input otherwise.

lmcache.integration.vllm.vllm_adapter.build_partial_prefill_input(model_input: ModelInputForGPUWithSamplingMetadata, full_tokens_list: List[Tensor], num_computed_tokens_list: List[int], start_pos_list: List[int], slot_mapping_flat: Tensor, lmc_num_computed_tokens_list: List[int], is_prefill_list: List[bool], do_sample_list: List[bool], device: device, cache_config: CacheConfig) ModelInputForGPUWithSamplingMetadata[source]#

Helper function to rebuild the model input for the current request.

lmcache.integration.vllm.vllm_adapter.close_lmcache_engine() None[source]#

Close the LMCache engine if it is initialized.

lmcache.integration.vllm.vllm_adapter.init_lmcache_engine(model_config: ModelConfig, parallel_config: ParallelConfig, cache_config: CacheConfig) LMCacheEngine | None[source]#

Initialize the LMCache engine by the given model config and parallel config. This function will check the environment variable LMCACHE_CONFIG_FILE to load the configuration file. If that environment variable is not set, this function will return None.

Parameters:
  • model_config (ModelConfig) – The model configuration in vLLM.

  • parallel_config (ParallelConfig) – The parallel configuration in vLLM.

  • cache_config (CacheConfig) – The KV cache configuration in vLLM.

Returns:

The initialized LMCache engine or None (if the environment variable LMCACHE_CONFIG_FILE is not set).

Return type:

Optional[LMCacheEngine]

lmcache.integration.vllm.vllm_adapter.lmcache_retrieve_kv(model_executable: Module, model_input: ModelInputForGPUWithSamplingMetadata, cache_config: CacheConfig, kv_caches: List[Tensor], retrieve_status: List[RetrieveStatus]) Tuple[ModelInputForGPUWithSamplingMetadata, bool, Tensor | IntermediateTensors][source]#

Retrieve the KV caches from LMCache for the current model_input. And rebuild the model_input to reflect the changes in KV if necessary.

Parameters:
  • model_executable (torch.nn.Module) – The model executable for the current request.

  • model_input (ModelInputForGPUWithSamplingMetadata) – The model input for the current request.

  • kv_caches (List[torch.Tensor]) – The paged memory to put KV to

  • retrieve_status (List[RetrieveStatus]) – Indicate whether and how KV cache of each req is retrieved

Returns:

The rebuilt model_input to reflect the changes in KV.

Returns:

The boolean value to indicate whether the entire execute_model should be skipped

lmcache.integration.vllm.vllm_adapter.lmcache_should_retrieve(model_input: ModelInputForGPUWithSamplingMetadata) List[RetrieveStatus][source]#

Check should we retrieve KV from LMCache for the current model_input.

Parameters:
  • model_input (ModelInputForGPUWithSamplingMetadata) – The model input for the current request.

  • kv_caches (List[torch.Tensor]) – The paged memory

Returns:

RetrieveStatus.

lmcache.integration.vllm.vllm_adapter.lmcache_should_store(model_input: ModelInputForGPUWithSamplingMetadata) List[StoreStatus][source]#

Check should we store KV into LMCache for the current model_input.

Parameters:

model_input (ModelInputForGPUWithSamplingMetadata) – The model input for the current request.

Returns:

A list of StoreStatus. StoreStatus.PREFILL/DECODE/CHUNK_PREFILL if we should store KV after PREFILL/DECODE. StoreStatus.NONE if no storing is required.

lmcache.integration.vllm.vllm_adapter.lmcache_store_kv(model_config: ModelConfig, parallel_config: ParallelConfig, cache_config: CacheConfig, model_executable: Module, model_input: ModelInputForGPUWithSamplingMetadata, kv_caches: List[Tensor], store_status: List[StoreStatus]) None[source]#

Store the KV caches into LMCache for the current model_input.

Parameters:
  • model_executable (torch.nn.Module) – The model executable for the current request.

  • model_input (ModelInputForGPUWithSamplingMetadata) – The model input for the current request.

  • kv_caches (List[torch.Tensor]) – The paged memory to get KV from

  • store_status (List[StoreStatus]) – Indicate whether and how KV cache of each req is stored

Module contents#