lmcache.integration.vllm package#
Submodules#
lmcache.integration.vllm.utils module#
- lmcache.integration.vllm.utils.lmcache_get_config() LMCacheEngineConfig | LMCacheEngineConfig [source]#
Get the LMCache configuration from the environment variable LMCACHE_CONFIG_FILE. If the environment variable is not set, this function will return the default configuration.
lmcache.integration.vllm.vllm_adapter module#
- class lmcache.integration.vllm.vllm_adapter.RetrieveStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
- CHUNK_PREFILL = 2#
- NONE = 4#
- PREFILL = 1#
- class lmcache.integration.vllm.vllm_adapter.StoreStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
Enum
- CHUNK_PREFILL = 2#
- DECODE = 3#
- NONE = 5#
- PREFILL = 1#
- SUFFIX_PREFILL = 4#
- lmcache.integration.vllm.vllm_adapter.broadcast_seq_group_list(model_input: ModelInputForGPUWithSamplingMetadata, is_driver_worker: bool) ModelInputForGPUWithSamplingMetadata [source]#
Broadcast the model_input from driver worker to non-driver workers.
- Parameters:
model_input (ModelInputForGPUWithSamplingMetadata) – The model input for the current request.
is_driver_worker (bool) – Whether the code is executed in driver worker.
- : return: Original model_input if driver_worker.
Broadcasted model_input otherwise.
- lmcache.integration.vllm.vllm_adapter.build_partial_prefill_input(model_input: ModelInputForGPUWithSamplingMetadata, full_tokens_list: List[Tensor], num_computed_tokens_list: List[int], start_pos_list: List[int], slot_mapping_flat: Tensor, lmc_num_computed_tokens_list: List[int], is_prefill_list: List[bool], do_sample_list: List[bool], device: device, cache_config: CacheConfig) ModelInputForGPUWithSamplingMetadata [source]#
Helper function to rebuild the model input for the current request.
- lmcache.integration.vllm.vllm_adapter.close_lmcache_engine() None [source]#
Close the LMCache engine if it is initialized.
- lmcache.integration.vllm.vllm_adapter.init_lmcache_engine(model_config: ModelConfig, parallel_config: ParallelConfig, cache_config: CacheConfig) LMCacheEngine | None [source]#
Initialize the LMCache engine by the given model config and parallel config. This function will check the environment variable LMCACHE_CONFIG_FILE to load the configuration file. If that environment variable is not set, this function will return None.
- Parameters:
model_config (ModelConfig) – The model configuration in vLLM.
parallel_config (ParallelConfig) – The parallel configuration in vLLM.
cache_config (CacheConfig) – The KV cache configuration in vLLM.
- Returns:
The initialized LMCache engine or None (if the environment variable LMCACHE_CONFIG_FILE is not set).
- Return type:
Optional[LMCacheEngine]
- lmcache.integration.vllm.vllm_adapter.lmcache_retrieve_kv(model_executable: Module, model_input: ModelInputForGPUWithSamplingMetadata, cache_config: CacheConfig, kv_caches: List[Tensor], retrieve_status: List[RetrieveStatus]) Tuple[ModelInputForGPUWithSamplingMetadata, bool, Tensor | IntermediateTensors] [source]#
Retrieve the KV caches from LMCache for the current model_input. And rebuild the model_input to reflect the changes in KV if necessary.
- Parameters:
model_executable (torch.nn.Module) – The model executable for the current request.
model_input (ModelInputForGPUWithSamplingMetadata) – The model input for the current request.
kv_caches (List[torch.Tensor]) – The paged memory to put KV to
retrieve_status (List[RetrieveStatus]) – Indicate whether and how KV cache of each req is retrieved
- Returns:
The rebuilt model_input to reflect the changes in KV.
- Returns:
The boolean value to indicate whether the entire execute_model should be skipped
- lmcache.integration.vllm.vllm_adapter.lmcache_should_retrieve(model_input: ModelInputForGPUWithSamplingMetadata) List[RetrieveStatus] [source]#
Check should we retrieve KV from LMCache for the current model_input.
- Parameters:
model_input (ModelInputForGPUWithSamplingMetadata) – The model input for the current request.
kv_caches (List[torch.Tensor]) – The paged memory
- Returns:
RetrieveStatus.
- lmcache.integration.vllm.vllm_adapter.lmcache_should_store(model_input: ModelInputForGPUWithSamplingMetadata) List[StoreStatus] [source]#
Check should we store KV into LMCache for the current model_input.
- Parameters:
model_input (ModelInputForGPUWithSamplingMetadata) – The model input for the current request.
- Returns:
A list of StoreStatus. StoreStatus.PREFILL/DECODE/CHUNK_PREFILL if we should store KV after PREFILL/DECODE. StoreStatus.NONE if no storing is required.
- lmcache.integration.vllm.vllm_adapter.lmcache_store_kv(model_config: ModelConfig, parallel_config: ParallelConfig, cache_config: CacheConfig, model_executable: Module, model_input: ModelInputForGPUWithSamplingMetadata, kv_caches: List[Tensor], store_status: List[StoreStatus]) None [source]#
Store the KV caches into LMCache for the current model_input.
- Parameters:
model_executable (torch.nn.Module) – The model executable for the current request.
model_input (ModelInputForGPUWithSamplingMetadata) – The model input for the current request.
kv_caches (List[torch.Tensor]) – The paged memory to get KV from
store_status (List[StoreStatus]) – Indicate whether and how KV cache of each req is stored