nomic-ai
/

modernbert-embed-base

@@ -2902,23 +2902,25 @@ base_model:
 # ModernBERT Embed
-ModernBERT Embed is an embedding model trained from [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), brining the new advances of ModernBERT to embeddings!
 Trained on the [Nomic Embed](https://arxiv.org/abs/2402.01613) weakly-supervised and supervised datasets, `modernbert-embed` also supports Matryoshka Representation Learning dimensions of 256, reducing memory by 3x with minimal performance loss.
 ## Performance
-| Model | Dimensions | Average (56) | Classification (12) | Clustering (11) | Pair Classification (3) | Reranking (4) | Retrieval (15) | STS (10) | Overall/Summ (1) |
-|-------|------------|--------------|--------------------:|-----------------|------------------------|---------------|----------------|-----------|-----------------|
-| nomic-embed-text-v1 | 768 | 62.4 | 74.1 | 43.9 | 85.2 | 55.7 | 52.8 | 82.1 | 30.1 |
-| nomic-embed-text-v1.5 | 768 | 62.28 | 73.55 | 43.93 | 84.61 | 55.78 | 53.01 | 81.94 | 30.4 |
-| ModernBERT | 768 | 62.62 | 74.31 | 44.98 | 83.96 | 56.42 | 52.89 | 81.78 | 31.39 |
-| nomic-embed-text-v1.5 | 256 | 61.04 | 72.1 | 43.16 | 84.09 | 55.18 | 50.81 | 81.34|
-| ModernBERT | 256 | 61.17 | 72.40 | 43.82 | 83.45 | 55.69 | 50.62 | 81.12 | 31.27 |
 ## Usage
-You can use these models directly with the transformers library. Until the next transformers release, doing so requires installing transformers from main:
 ```bash
 pip install git+https://github.com/huggingface/transformers.git
@@ -2926,7 +2928,59 @@ pip install git+https://github.com/huggingface/transformers.git
 Reminder, this model is trained similarly to Nomic Embed and **REQUIRES** prefixes to be added to the input. For more information, see the instructions in [Nomic Embed](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#task-instruction-prefixes).
-Most use cases, adding `search_query` to the query and `search_document` to the documents will be sufficient.
 ### Transformers
@@ -2935,48 +2989,95 @@ import torch
 import torch.nn.functional as F
 from transformers import AutoTokenizer, AutoModel
 def mean_pooling(model_output, attention_mask):
     token_embeddings = model_output[0]
-    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
-    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
-sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
-tokenizer = AutoTokenizer.from_pretrained('nomic-ai/modernbert-embed')
-model = AutoModel.from_pretrained('nomic-ai/modernbert-embed')
-model.eval()
-encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
-matryoshka_dim = 768
 with torch.no_grad():
-    model_output = model(**encoded_input)
-embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
-embeddings = embeddings[:, :matryoshka_dim]
-embeddings = F.normalize(embeddings, p=2, dim=1)
-print(embeddings)
 ```
-### Sentence Transformers
 ```python
-from sentence_transformers import SentenceTransformer
-model = SentenceTransformer(
-    "nomic-ai/modernbert-embed",
-)
-# Verify that everything works as expected
-embeddings = model.encode(['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?'])
-print(embeddings.shape)
-similarities = model.similarity(embeddings, embeddings)
 print(similarities)
 ```
 ## Training

 # ModernBERT Embed
+ModernBERT Embed is an embedding model trained from [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), bringing the new advances of ModernBERT to embeddings!
 Trained on the [Nomic Embed](https://arxiv.org/abs/2402.01613) weakly-supervised and supervised datasets, `modernbert-embed` also supports Matryoshka Representation Learning dimensions of 256, reducing memory by 3x with minimal performance loss.
 ## Performance
+| Model                 | Dimensions | Average (56) | Classification (12) | Clustering (11) | Pair Classification (3) | Reranking (4) | Retrieval (15) | STS (10)  | Overall/Summ (1) |
+|-----------------------|------------|--------------|---------------------|-----------------|-------------------------|---------------|----------------|-----------|------------------|
+| nomic-embed-text-v1   | 768        | 62.4         | 74.1                | 43.9            | **85.2**                | 55.7          | 52.8           | 82.1      | 30.1             |
+| nomic-embed-text-v1.5 | 768        | 62.28        | 73.55               | 43.93           | 84.61                   | 55.78         | **53.01**      | **81.94** | 30.4             |
+| modernbert-embed      | 768        | **62.62**    | **74.31**           | **44.98**       | 83.96                   | **56.42**     | 52.89          | 81.78     | **31.39**        |
+| nomic-embed-text-v1.5 | 256        | 61.04        | 72.1                | 43.16           | 84.09                   | 55.18         | 50.81          | 81.34     |                  |
+| modernbert-embed      | 256        | 61.17        | 72.40               | 43.82           | 83.45                   | 55.69         | 50.62          | 81.12     | 31.27            |
 ## Usage
+You can use these models directly with the transformers library. Until the next transformers release, doing so requires installing `transformers` from `main`:
 ```bash
 pip install git+https://github.com/huggingface/transformers.git
 Reminder, this model is trained similarly to Nomic Embed and **REQUIRES** prefixes to be added to the input. For more information, see the instructions in [Nomic Embed](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#task-instruction-prefixes).
+Most use cases, adding `search_query: ` to the query and `search_document: ` to the documents will be sufficient.
+### Sentence Transformers
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("nomic-ai/modernbert-embed")
+query_embeddings = model.encode([
+    "search_query: What is TSNE?",
+    "search_query: Who is Laurens van der Maaten?",
+])
+doc_embeddings = model.encode([
+    "search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten",
+])
+print(query_embeddings.shape, doc_embeddings.shape)
+# (2, 768) (1, 768)
+similarities = model.similarity(query_embeddings, doc_embeddings)
+print(similarities)
+# tensor([[0.7214],
+#         [0.3260]])
+```
+<details><summary>Click to see Sentence Transformers usage with Matryoshka Truncation</summary>
+In Sentence Transformers, you can truncate embeddings to a smaller dimension by using the `truncate_dim` parameter when loading the `SentenceTransformer` model.
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("nomic-ai/modernbert-embed", truncate_dim=256)
+query_embeddings = model.encode([
+    "search_query: What is TSNE?",
+    "search_query: Who is Laurens van der Maaten?",
+])
+doc_embeddings = model.encode([
+    "search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten",
+])
+print(query_embeddings.shape, doc_embeddings.shape)
+# (2, 256) (1, 256)
+similarities = model.similarity(query_embeddings, doc_embeddings)
+print(similarities)
+# tensor([[0.7759],
+#         [0.3419]])
+```
+Note the small differences compared to the full 768-dimensional similarities.
+</details>
 ### Transformers
 import torch.nn.functional as F
 from transformers import AutoTokenizer, AutoModel
 def mean_pooling(model_output, attention_mask):
     token_embeddings = model_output[0]
+    input_mask_expanded = (
+        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    )
+    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
+        input_mask_expanded.sum(1), min=1e-9
+    )
+queries = ["search_query: What is TSNE?", "search_query: Who is Laurens van der Maaten?"]
+documents = ["search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten"]
+tokenizer = AutoTokenizer.from_pretrained(".")
+model = AutoModel.from_pretrained(".")
+encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
+encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")
 with torch.no_grad():
+    queries_outputs = model(**encoded_queries)
+    documents_outputs = model(**encoded_documents)
+query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
+query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
+doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
+doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)
+print(query_embeddings.shape, doc_embeddings.shape)
+# torch.Size([2, 768]) torch.Size([1, 768])
+similarities = query_embeddings @ doc_embeddings.T
+print(similarities)
+# tensor([[0.7214],
+#         [0.3260]])
 ```
+<details><summary>Click to see Transformers usage with Matryoshka Truncation</summary>
+In `transformers`, you can truncate embeddings to a smaller dimension by slicing the mean pooled embeddings, prior to normalization.
 ```python
+import torch
+import torch.nn.functional as F
+from transformers import AutoTokenizer, AutoModel
+def mean_pooling(model_output, attention_mask):
+    token_embeddings = model_output[0]
+    input_mask_expanded = (
+        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    )
+    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
+        input_mask_expanded.sum(1), min=1e-9
+    )
+queries = ["search_query: What is TSNE?", "search_query: Who is Laurens van der Maaten?"]
+documents = ["search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten"]
+tokenizer = AutoTokenizer.from_pretrained(".")
+model = AutoModel.from_pretrained(".")
+truncate_dim = 256
+encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
+encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")
+with torch.no_grad():
+    queries_outputs = model(**encoded_queries)
+    documents_outputs = model(**encoded_documents)
+query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
+query_embeddings = query_embeddings[:, :truncate_dim]
+query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
+doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
+doc_embeddings = doc_embeddings[:, :truncate_dim]
+doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)
+print(query_embeddings.shape, doc_embeddings.shape)
+# torch.Size([2, 256]) torch.Size([1, 256])
+similarities = query_embeddings @ doc_embeddings.T
 print(similarities)
+# tensor([[0.7759],
+#         [0.3419]])
 ```
+Note the small differences compared to the full 768-dimensional similarities.
+</details>
 ## Training