Multilingual-E5-base (sentence-transformers)

This is a the sentence-transformers version of the intfloat/multilingual-e5-base model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
# Each input text should start with "query: " or "passage: ", even for non-English texts.
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = ['query: how much protein should a female eat',
               'query: ๅ—็“œ็š„ๅฎถๅธธๅšๆณ•',
               "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
               "passage: 1.ๆธ…็‚’ๅ—็“œไธ ๅŽŸๆ–™:ๅซฉๅ—็“œๅŠไธช ่ฐƒๆ–™:่‘ฑใ€็›ใ€็™ฝ็ณ–ใ€้ธก็ฒพ ๅšๆณ•: 1ใ€ๅ—็“œ็”จๅˆ€่–„่–„็š„ๅ‰ŠๅŽป่กจ้ขไธ€ๅฑ‚็šฎ,็”จๅ‹บๅญๅˆฎๅŽป็“ค 2ใ€ๆ“ฆๆˆ็ป†ไธ(ๆฒกๆœ‰ๆ“ฆ่œๆฟๅฐฑ็”จๅˆ€ๆ…ขๆ…ขๅˆ‡ๆˆ็ป†ไธ) 3ใ€้”…็ƒง็ƒญๆ”พๆฒน,ๅ…ฅ่‘ฑ่Šฑ็…ธๅ‡บ้ฆ™ๅ‘ณ 4ใ€ๅ…ฅๅ—็“œไธๅฟซ้€Ÿ็ฟป็‚’ไธ€ๅˆ†้’Ÿๅทฆๅณ,ๆ”พ็›ใ€ไธ€็‚น็™ฝ็ณ–ๅ’Œ้ธก็ฒพ่ฐƒๅ‘ณๅ‡บ้”… 2.้ฆ™่‘ฑ็‚’ๅ—็“œ ๅŽŸๆ–™:ๅ—็“œ1ๅช ่ฐƒๆ–™:้ฆ™่‘ฑใ€่’œๆœซใ€ๆฉ„ๆฆ„ๆฒนใ€็› ๅšๆณ•: 1ใ€ๅฐ†ๅ—็“œๅŽป็šฎ,ๅˆ‡ๆˆ็‰‡ 2ใ€ๆฒน้”…8ๆˆ็ƒญๅŽ,ๅฐ†่’œๆœซๆ”พๅ…ฅ็ˆ†้ฆ™ 3ใ€็ˆ†้ฆ™ๅŽ,ๅฐ†ๅ—็“œ็‰‡ๆ”พๅ…ฅ,็ฟป็‚’ 4ใ€ๅœจ็ฟป็‚’็š„ๅŒๆ—ถ,ๅฏไปฅไธๆ—ถๅœฐๅพ€้”…้‡ŒๅŠ ๆฐด,ไฝ†ไธ่ฆๅคชๅคš 5ใ€ๆ”พๅ…ฅ็›,็‚’ๅŒ€ 6ใ€ๅ—็“œๅทฎไธๅคš่ฝฏๅ’Œ็ปตไบ†ไน‹ๅŽ,ๅฐฑๅฏไปฅๅ…ณ็ซ 7ใ€ๆ’’ๅ…ฅ้ฆ™่‘ฑ,ๅณๅฏๅ‡บ้”…"]


model = SentenceTransformer('embaas/sentence-transformers-multilingual-e5-base')
embeddings = model.encode(sentences)
print(embeddings)

Usage (Huggingface)

import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]


# Each input text should start with "query: " or "passage: ", even for non-English texts.
# For tasks other than retrieval, you can simply use the "query: " prefix.
input_texts = ['query: how much protein should a female eat',
               'query: ๅ—็“œ็š„ๅฎถๅธธๅšๆณ•',
               "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
               "passage: 1.ๆธ…็‚’ๅ—็“œไธ ๅŽŸๆ–™:ๅซฉๅ—็“œๅŠไธช ่ฐƒๆ–™:่‘ฑใ€็›ใ€็™ฝ็ณ–ใ€้ธก็ฒพ ๅšๆณ•: 1ใ€ๅ—็“œ็”จๅˆ€่–„่–„็š„ๅ‰ŠๅŽป่กจ้ขไธ€ๅฑ‚็šฎ,็”จๅ‹บๅญๅˆฎๅŽป็“ค 2ใ€ๆ“ฆๆˆ็ป†ไธ(ๆฒกๆœ‰ๆ“ฆ่œๆฟๅฐฑ็”จๅˆ€ๆ…ขๆ…ขๅˆ‡ๆˆ็ป†ไธ) 3ใ€้”…็ƒง็ƒญๆ”พๆฒน,ๅ…ฅ่‘ฑ่Šฑ็…ธๅ‡บ้ฆ™ๅ‘ณ 4ใ€ๅ…ฅๅ—็“œไธๅฟซ้€Ÿ็ฟป็‚’ไธ€ๅˆ†้’Ÿๅทฆๅณ,ๆ”พ็›ใ€ไธ€็‚น็™ฝ็ณ–ๅ’Œ้ธก็ฒพ่ฐƒๅ‘ณๅ‡บ้”… 2.้ฆ™่‘ฑ็‚’ๅ—็“œ ๅŽŸๆ–™:ๅ—็“œ1ๅช ่ฐƒๆ–™:้ฆ™่‘ฑใ€่’œๆœซใ€ๆฉ„ๆฆ„ๆฒนใ€็› ๅšๆณ•: 1ใ€ๅฐ†ๅ—็“œๅŽป็šฎ,ๅˆ‡ๆˆ็‰‡ 2ใ€ๆฒน้”…8ๆˆ็ƒญๅŽ,ๅฐ†่’œๆœซๆ”พๅ…ฅ็ˆ†้ฆ™ 3ใ€็ˆ†้ฆ™ๅŽ,ๅฐ†ๅ—็“œ็‰‡ๆ”พๅ…ฅ,็ฟป็‚’ 4ใ€ๅœจ็ฟป็‚’็š„ๅŒๆ—ถ,ๅฏไปฅไธๆ—ถๅœฐๅพ€้”…้‡ŒๅŠ ๆฐด,ไฝ†ไธ่ฆๅคชๅคš 5ใ€ๆ”พๅ…ฅ็›,็‚’ๅŒ€ 6ใ€ๅ—็“œๅทฎไธๅคš่ฝฏๅ’Œ็ปตไบ†ไน‹ๅŽ,ๅฐฑๅฏไปฅๅ…ณ็ซ 7ใ€ๆ’’ๅ…ฅ้ฆ™่‘ฑ,ๅณๅฏๅ‡บ้”…"]

tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-base')
model = AutoModel.from_pretrained('intfloat/multilingual-e5-base')

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

Using with API

You can use the embaas API to encode your input. Get your free API key from embaas.io

import requests
 
url = "https://api.embaas.io/v1/embeddings/"
 
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer ${YOUR_API_KEY}"
}
 
data = {
    "texts": ["This is an example sentence.", "Here is another sentence."],
    "instruction": "query"
    "model": "multilingual-e5-base"
}
 
response = requests.post(url, json=data, headers=headers)

Evaluation Results

You can find the MTEB results here.

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
  (2): Normalize()
)

Citing & Authors

Downloads last month
1,458
Inference API

Spaces using embaas/sentence-transformers-multilingual-e5-base 12