Update README.md
Browse files
README.md
CHANGED
@@ -5287,7 +5287,7 @@ license: mit
|
|
5287 |
|
5288 |
## E5-mistral-7b-instruct
|
5289 |
|
5290 |
-
**[TODO] Technical
|
5291 |
|
5292 |
Some highlights for preview:
|
5293 |
* This model is only fine-tuned for less than 1000 steps, no contrastive pre-training is used.
|
@@ -5305,19 +5305,17 @@ import torch.nn.functional as F
|
|
5305 |
|
5306 |
from torch import Tensor
|
5307 |
from transformers import AutoTokenizer, AutoModel
|
5308 |
-
from transformers.file_utils import PaddingStrategy
|
5309 |
|
5310 |
|
5311 |
def last_token_pool(last_hidden_states: Tensor,
|
5312 |
attention_mask: Tensor) -> Tensor:
|
5313 |
-
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
|
5314 |
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
|
5315 |
if left_padding:
|
5316 |
-
return
|
5317 |
else:
|
5318 |
sequence_lengths = attention_mask.sum(dim=1) - 1
|
5319 |
-
batch_size =
|
5320 |
-
return
|
5321 |
|
5322 |
|
5323 |
def get_detailed_instruct(task_description: str, query: str) -> str:
|
@@ -5336,7 +5334,7 @@ model = AutoModel.from_pretrained('intfloat/e5-mistral-7b-instruct')
|
|
5336 |
|
5337 |
max_length = 4096
|
5338 |
# Tokenize the input texts
|
5339 |
-
batch_dict = tokenizer(input_texts, max_length=max_length - 1, return_attention_mask=False, padding=
|
5340 |
# append eos_token_id to every input_ids
|
5341 |
batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
|
5342 |
batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')
|
@@ -5371,6 +5369,8 @@ Yes, this is how the model is trained, otherwise you will see a performance degr
|
|
5371 |
The task definition should be a one-sentence instruction that describes the task.
|
5372 |
This is a way to customize text embeddings for different scenarios through natural language instructions.
|
5373 |
|
|
|
|
|
5374 |
On the other hand, there is no need to add instructions to the document side.
|
5375 |
|
5376 |
**2. Why are my reproduced results slightly different from reported in the model card?**
|
|
|
5287 |
|
5288 |
## E5-mistral-7b-instruct
|
5289 |
|
5290 |
+
**[TODO] Technical report on the model training and evaluation will be available before 2024-01-01.**
|
5291 |
|
5292 |
Some highlights for preview:
|
5293 |
* This model is only fine-tuned for less than 1000 steps, no contrastive pre-training is used.
|
|
|
5305 |
|
5306 |
from torch import Tensor
|
5307 |
from transformers import AutoTokenizer, AutoModel
|
|
|
5308 |
|
5309 |
|
5310 |
def last_token_pool(last_hidden_states: Tensor,
|
5311 |
attention_mask: Tensor) -> Tensor:
|
|
|
5312 |
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
|
5313 |
if left_padding:
|
5314 |
+
return last_hidden_states[:, -1]
|
5315 |
else:
|
5316 |
sequence_lengths = attention_mask.sum(dim=1) - 1
|
5317 |
+
batch_size = last_hidden_states.shape[0]
|
5318 |
+
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
|
5319 |
|
5320 |
|
5321 |
def get_detailed_instruct(task_description: str, query: str) -> str:
|
|
|
5334 |
|
5335 |
max_length = 4096
|
5336 |
# Tokenize the input texts
|
5337 |
+
batch_dict = tokenizer(input_texts, max_length=max_length - 1, return_attention_mask=False, padding=False, truncation=True)
|
5338 |
# append eos_token_id to every input_ids
|
5339 |
batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]
|
5340 |
batch_dict = tokenizer.pad(batch_dict, padding=True, return_attention_mask=True, return_tensors='pt')
|
|
|
5369 |
The task definition should be a one-sentence instruction that describes the task.
|
5370 |
This is a way to customize text embeddings for different scenarios through natural language instructions.
|
5371 |
|
5372 |
+
Please check out [unilm/e5/utils.py](https://github.com/microsoft/unilm/blob/16da2f193b9c1dab0a692c6e4380bd43e70a40cd/e5/utils.py#L93) for instructions we used for evaluation.
|
5373 |
+
|
5374 |
On the other hand, there is no need to add instructions to the document side.
|
5375 |
|
5376 |
**2. Why are my reproduced results slightly different from reported in the model card?**
|