ember-v1
This model has been trained on an extensive corpus of text pairs that encompass a broad spectrum of domains, including finance, science, medicine, law, and various others. During the training process, we incorporated techniques derived from the RetroMAE and SetFit research papers.
Plans
- The research paper will be published soon.
- The v2 of the model is currently in development and will feature an extended maximum sequence length of 4,000 tokens.
Usage
Use with transformers:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
input_texts = [
"This is an example sentence",
"Each sentence is converted"
]
tokenizer = AutoTokenizer.from_pretrained("llmrails/ember-v1")
model = AutoModel.from_pretrained("llmrails/ember-v1")
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
Use with sentence-transformers:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
sentences = [
"This is an example sentence",
"Each sentence is converted"
]
model = SentenceTransformer('llmrails/ember-v1')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))
Massive Text Embedding Benchmark (MTEB) Evaluation
Our model achieve state-of-the-art performance on MTEB leaderboard
Model Name | Dimension | Sequence Length | Average (56) |
---|---|---|---|
ember-v1 | 1024 | 512 | 63.54 |
bge-large-en-v1.5 | 1024 | 512 | 63.23 |
bge-base-en-v1.5 | 768 | 512 | 63.05 |
text-embedding-ada-002 | 1536 | 8191 | 60.99 |
Limitation
This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.
License
MIT
Citation
@misc{nur2024emberv1,
title={ember-v1: SOTA embedding model},
author={Enrike Nur and Anar Aliyev},
year={2023},
}
- Downloads last month
- 41,584
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
the model is not deployed on the HF Inference API.
Model tree for llmrails/ember-v1
Spaces using llmrails/ember-v1 6
Evaluation results
- accuracy on MTEB AmazonCounterfactualClassification (en)test set self-reported76.060
- ap on MTEB AmazonCounterfactualClassification (en)test set self-reported38.760
- f1 on MTEB AmazonCounterfactualClassification (en)test set self-reported69.882
- accuracy on MTEB AmazonPolarityClassificationtest set self-reported91.977
- ap on MTEB AmazonPolarityClassificationtest set self-reported88.635
- f1 on MTEB AmazonPolarityClassificationtest set self-reported91.952
- accuracy on MTEB AmazonReviewsClassification (en)test set self-reported47.938
- f1 on MTEB AmazonReviewsClassification (en)test set self-reported47.583
- map_at_1 on MTEB ArguAnatest set self-reported41.252
- map_at_10 on MTEB ArguAnatest set self-reported56.567