[docs]@_lmcache_nvtx_annotatedefdecode_chunk(cdf:torch.Tensor,data_chunk:CacheGenGPUBytestream,target_buffer:torch.Tensor,)->None:""" Write the decode output in target_buffer Expected shape: [nlayers (kv in total), ntokens, nchannels] """bytes_tensor=data_chunk.bytestreamlength_prefsum=(data_chunk.bytestream_lengths.flatten().cumsum(0).reshape(data_chunk.bytestream_lengths.shape))torchac_cuda.decode_fast_prefsum(cdf,bytes_tensor,length_prefsum,target_buffer)
[docs]@_lmcache_nvtx_annotatedefdecode_function_gpu(cdf:torch.Tensor,data_chunks:List[CacheGenGPUBytestream],layers_in_key:int,chunk_size:int,output:torch.Tensor,):# TODO: dtype and shape -- still have 128 and 8""" Given the path to the encoded KV bytestream, decode the KV cache Inputs: cdf: the cdf tensor, in shape [2 * nlayers, nchannels, bins + 1] data_chunks: the data_chunks in the encoder's output layers_in_key: number of layers in K (or V) (K/V should have the same number of layers) chunk_size: the chunk_size output: output buffer, in shape [ntokens, 2 * nlayers * nchannels] Outputs: key: the decoded key tensor in the shape of (layers, tokens, nchannels) value: the decoded value tensor in the shape of (layers, tokens, nchannels) """nlayers,nchannels,_=cdf.shapeoutput=output.reshape((nlayers,chunk_size,nchannels))start=0fordata_chunkindata_chunks:end=start+data_chunk.ntokensdecode_chunk(cdf,data_chunk,output[:,start:end,:])start=endout=output.reshape((2,layers_in_key,chunk_size,nchannels))key,value=out.float()returnkey,value
[docs]@_lmcache_nvtx_annotatedeffrom_bytes(self,bs:bytes)->torch.Tensor:encoder_output=CacheGenGPUEncoderOutput.from_bytes(bs)encoder_output.max_tensors_key=encoder_output.max_tensors_key.cuda()encoder_output.max_tensors_value=(encoder_output.max_tensors_value.cuda())ntokens=encoder_output.max_tensors_key.shape[1]layers_in_key=encoder_output.max_tensors_key.shape[0]key,value=decode_function_gpu(encoder_output.cdf,encoder_output.data_chunks,layers_in_key,ntokens,self.get_output_buffer(encoder_output.cdf.shape[0]//2,encoder_output.cdf.shape[1],ntokens,),)# Temporary fix for #83: change the device of key_bins and value_bins# to the device of key and value# This requires a long-term fix in the future. Currently,# CacheGenGPUEncoderOutput has implicit device in itself.# More specifically, if the encoder encodes the tensor on GPU0, the# from_bytes will also return a tensor on GPU0# We may want to dynamically configure the device based on config and# metadata in the futureifself.key_bins.device!=key.device:self.key_bins=self.key_bins.to(key.device)ifself.value_bins.device!=value.device:self.value_bins=self.value_bins.cuda()key=do_dequantize(key,self.key_bins,encoder_output.max_tensors_key)value=do_dequantize(value,self.value_bins,encoder_output.max_tensors_value)""" merge key and value back and reshape """nlayers,ntokens,nchannels=key.shapeblob=torch.stack([key,value])# [2, nlayers, ntokens, nchannels]blob=blob.reshape((2,nlayers,ntokens,encoder_output.num_heads,encoder_output.head_size,))matchself.fmt:case"vllm":returnblob.permute((1,0,2,3,4)).to(self.dtype)# [nlayers, 2, ntokens, num_heads, head_size]case"huggingface":returnblob.permute((1,0,3,2,4)).to(self.dtype)# [nlayers, 2, num_heads, ntokens, head_size]case_:raiseRuntimeError("Unknown format %s"%self.fmt)