Update modeling_nvembed.py

#49
by lukelv - opened

Update _do_encode(), the function can return FloatTensor and Numpy.

Hello Author,
I was so glad to work with your model when I tried to train and encode a list of sentences. I used _do_encode() to handle this and recognized that you converted each vector to a numpy array in each step of the for loop. It is okay, but I needed to return my vectors to Tensor, so I tried to modify the function like this:

    

@torch
	.no_grad()
    def _do_encode(self,
        prompts: List[str],
        batch_size: int=1,
        instruction: str="",
        max_length: int=4096,
        num_workers: int=32,
        **kwargs
    ) -> Union[np.ndarray, torch.FloatTensor]:
        dataset: Dataset = Dataset.from_dict({'input_texts': prompts})
        dataset.set_transform(partial(input_transform_func,
                                      self.tokenizer,
                                      always_add_eos=True,
                                      max_length=max_length,
                                      instruction=instruction))

        data_collator = DataCollatorWithPadding(self.tokenizer)
        data_loader = DataLoader(
            dataset,
            batch_size=batch_size,
            shuffle=False,
            drop_last=False,
            num_workers=num_workers,
            collate_fn=data_collator,
            pin_memory=True)

        if self.padding_side == "right" and self.is_mask_instruction == True and len(instruction) > 0:
            instruction_lens = len(self.tokenizer.tokenize(instruction))
        else:
            instruction_lens = 0

        encoded_embeds = []
        device = next(self.embedding_model.parameters()).device
        for batch_dict in tqdm(data_loader, desc='encoding', mininterval=10):
            features = self.prepare_kwargs_from_batch(batch_dict, instruction_lens, device=device)
            embeds=self(**features)["sentence_embeddings"].squeeze(1)
            encoded_embeds.append(embeds)
        encoded_embeds = torch.cat(encoded_embeds, axis=0)
        if "return_numpy" in kwargs and  kwargs.get("return_numpy"):
            encoded_embeds = encoded_embeds.cpu().detach().numpy()
        return encoded_embeds

It can return 2 types of data now. Moreover, I recognize that the function can encode faster than the previous version because it just converts the tensor to Numpy array once after finishing for loop. You can consider this.

nada5 changed pull request status to merged

Sign up or log in to comment