Multilingual-E5-base (sentence-transformers)
This is a the sentence-transformers version of the intfloat/multilingual-e5-base model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
Usage (Sentence-Transformers)
Using this model becomes easy when you have sentence-transformers installed:
pip install -U sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
# Each input text should start with "query: " or "passage: ", even for non-English texts.
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = ['query: how much protein should a female eat',
'query: ๅ็็ๅฎถๅธธๅๆณ',
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"passage: 1.ๆธ
็ๅ็ไธ ๅๆ:ๅซฉๅ็ๅไธช ่ฐๆ:่ฑใ็ใ็ฝ็ณใ้ธก็ฒพ ๅๆณ: 1ใๅ็็จๅ่่็ๅๅป่กจ้ขไธๅฑ็ฎ,็จๅบๅญๅฎๅป็ค 2ใๆฆๆ็ปไธ(ๆฒกๆๆฆ่ๆฟๅฐฑ็จๅๆ
ขๆ
ขๅๆ็ปไธ) 3ใ้
็ง็ญๆพๆฒน,ๅ
ฅ่ฑ่ฑ็
ธๅบ้ฆๅณ 4ใๅ
ฅๅ็ไธๅฟซ้็ฟป็ไธๅ้ๅทฆๅณ,ๆพ็ใไธ็น็ฝ็ณๅ้ธก็ฒพ่ฐๅณๅบ้
2.้ฆ่ฑ็ๅ็ ๅๆ:ๅ็1ๅช ่ฐๆ:้ฆ่ฑใ่ๆซใๆฉๆฆๆฒนใ็ ๅๆณ: 1ใๅฐๅ็ๅป็ฎ,ๅๆ็ 2ใๆฒน้
8ๆ็ญๅ,ๅฐ่ๆซๆพๅ
ฅ็้ฆ 3ใ็้ฆๅ,ๅฐๅ็็ๆพๅ
ฅ,็ฟป็ 4ใๅจ็ฟป็็ๅๆถ,ๅฏไปฅไธๆถๅฐๅพ้
้ๅ ๆฐด,ไฝไธ่ฆๅคชๅค 5ใๆพๅ
ฅ็,็ๅ 6ใๅ็ๅทฎไธๅค่ฝฏๅ็ปตไบไนๅ,ๅฐฑๅฏไปฅๅ
ณ็ซ 7ใๆๅ
ฅ้ฆ่ฑ,ๅณๅฏๅบ้
"]
model = SentenceTransformer('embaas/sentence-transformers-multilingual-e5-base')
embeddings = model.encode(sentences)
print(embeddings)
Usage (Huggingface)
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
# Each input text should start with "query: " or "passage: ", even for non-English texts.
# For tasks other than retrieval, you can simply use the "query: " prefix.
input_texts = ['query: how much protein should a female eat',
'query: ๅ็็ๅฎถๅธธๅๆณ',
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"passage: 1.ๆธ
็ๅ็ไธ ๅๆ:ๅซฉๅ็ๅไธช ่ฐๆ:่ฑใ็ใ็ฝ็ณใ้ธก็ฒพ ๅๆณ: 1ใๅ็็จๅ่่็ๅๅป่กจ้ขไธๅฑ็ฎ,็จๅบๅญๅฎๅป็ค 2ใๆฆๆ็ปไธ(ๆฒกๆๆฆ่ๆฟๅฐฑ็จๅๆ
ขๆ
ขๅๆ็ปไธ) 3ใ้
็ง็ญๆพๆฒน,ๅ
ฅ่ฑ่ฑ็
ธๅบ้ฆๅณ 4ใๅ
ฅๅ็ไธๅฟซ้็ฟป็ไธๅ้ๅทฆๅณ,ๆพ็ใไธ็น็ฝ็ณๅ้ธก็ฒพ่ฐๅณๅบ้
2.้ฆ่ฑ็ๅ็ ๅๆ:ๅ็1ๅช ่ฐๆ:้ฆ่ฑใ่ๆซใๆฉๆฆๆฒนใ็ ๅๆณ: 1ใๅฐๅ็ๅป็ฎ,ๅๆ็ 2ใๆฒน้
8ๆ็ญๅ,ๅฐ่ๆซๆพๅ
ฅ็้ฆ 3ใ็้ฆๅ,ๅฐๅ็็ๆพๅ
ฅ,็ฟป็ 4ใๅจ็ฟป็็ๅๆถ,ๅฏไปฅไธๆถๅฐๅพ้
้ๅ ๆฐด,ไฝไธ่ฆๅคชๅค 5ใๆพๅ
ฅ็,็ๅ 6ใๅ็ๅทฎไธๅค่ฝฏๅ็ปตไบไนๅ,ๅฐฑๅฏไปฅๅ
ณ็ซ 7ใๆๅ
ฅ้ฆ่ฑ,ๅณๅฏๅบ้
"]
tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-base')
model = AutoModel.from_pretrained('intfloat/multilingual-e5-base')
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
Using with API
You can use the embaas API to encode your input. Get your free API key from embaas.io
import requests
url = "https://api.embaas.io/v1/embeddings/"
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer ${YOUR_API_KEY}"
}
data = {
"texts": ["This is an example sentence.", "Here is another sentence."],
"instruction": "query"
"model": "multilingual-e5-base"
}
response = requests.post(url, json=data, headers=headers)
Evaluation Results
You can find the MTEB results here.
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
(2): Normalize()
)
Citing & Authors
- Downloads last month
- 1,458