Fix typo; update README script + specific MRL snippets
#2
by
tomaarsen
HF staff
- opened
README.md
CHANGED
@@ -2902,23 +2902,25 @@ base_model:
|
|
2902 |
|
2903 |
# ModernBERT Embed
|
2904 |
|
2905 |
-
ModernBERT Embed is an embedding model trained from [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base),
|
2906 |
|
2907 |
Trained on the [Nomic Embed](https://arxiv.org/abs/2402.01613) weakly-supervised and supervised datasets, `modernbert-embed` also supports Matryoshka Representation Learning dimensions of 256, reducing memory by 3x with minimal performance loss.
|
2908 |
|
2909 |
## Performance
|
2910 |
|
2911 |
-
| Model
|
2912 |
-
|
2913 |
-
| nomic-embed-text-v1
|
2914 |
-
| nomic-embed-text-v1.5 | 768
|
2915 |
-
|
|
2916 |
-
| nomic-embed-text-v1.5 | 256
|
2917 |
-
|
|
|
|
|
|
2918 |
|
2919 |
## Usage
|
2920 |
|
2921 |
-
You can use these models directly with the transformers library. Until the next transformers release, doing so requires installing transformers from main
|
2922 |
|
2923 |
```bash
|
2924 |
pip install git+https://github.com/huggingface/transformers.git
|
@@ -2926,7 +2928,59 @@ pip install git+https://github.com/huggingface/transformers.git
|
|
2926 |
|
2927 |
Reminder, this model is trained similarly to Nomic Embed and **REQUIRES** prefixes to be added to the input. For more information, see the instructions in [Nomic Embed](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#task-instruction-prefixes).
|
2928 |
|
2929 |
-
Most use cases, adding `search_query` to the query and `search_document` to the documents will be sufficient.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2930 |
|
2931 |
### Transformers
|
2932 |
|
@@ -2935,48 +2989,95 @@ import torch
|
|
2935 |
import torch.nn.functional as F
|
2936 |
from transformers import AutoTokenizer, AutoModel
|
2937 |
|
|
|
2938 |
def mean_pooling(model_output, attention_mask):
|
2939 |
token_embeddings = model_output[0]
|
2940 |
-
input_mask_expanded =
|
2941 |
-
|
|
|
|
|
|
|
|
|
2942 |
|
2943 |
-
sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
|
2944 |
|
2945 |
-
|
2946 |
-
|
2947 |
-
model.eval()
|
2948 |
|
2949 |
-
|
|
|
2950 |
|
2951 |
-
|
|
|
2952 |
|
2953 |
with torch.no_grad():
|
2954 |
-
|
|
|
2955 |
|
|
|
|
|
|
|
|
|
|
|
|
|
2956 |
|
2957 |
-
|
2958 |
-
|
2959 |
-
|
2960 |
-
|
2961 |
```
|
2962 |
|
2963 |
-
|
|
|
|
|
2964 |
|
2965 |
```python
|
2966 |
-
|
|
|
|
|
2967 |
|
2968 |
-
model = SentenceTransformer(
|
2969 |
-
"nomic-ai/modernbert-embed",
|
2970 |
-
)
|
2971 |
|
2972 |
-
|
2973 |
-
|
2974 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2975 |
|
2976 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2977 |
print(similarities)
|
|
|
|
|
2978 |
```
|
2979 |
|
|
|
|
|
|
|
2980 |
|
2981 |
## Training
|
2982 |
|
|
|
2902 |
|
2903 |
# ModernBERT Embed
|
2904 |
|
2905 |
+
ModernBERT Embed is an embedding model trained from [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), bringing the new advances of ModernBERT to embeddings!
|
2906 |
|
2907 |
Trained on the [Nomic Embed](https://arxiv.org/abs/2402.01613) weakly-supervised and supervised datasets, `modernbert-embed` also supports Matryoshka Representation Learning dimensions of 256, reducing memory by 3x with minimal performance loss.
|
2908 |
|
2909 |
## Performance
|
2910 |
|
2911 |
+
| Model | Dimensions | Average (56) | Classification (12) | Clustering (11) | Pair Classification (3) | Reranking (4) | Retrieval (15) | STS (10) | Overall/Summ (1) |
|
2912 |
+
|-----------------------|------------|--------------|---------------------|-----------------|-------------------------|---------------|----------------|-----------|------------------|
|
2913 |
+
| nomic-embed-text-v1 | 768 | 62.4 | 74.1 | 43.9 | **85.2** | 55.7 | 52.8 | 82.1 | 30.1 |
|
2914 |
+
| nomic-embed-text-v1.5 | 768 | 62.28 | 73.55 | 43.93 | 84.61 | 55.78 | **53.01** | **81.94** | 30.4 |
|
2915 |
+
| modernbert-embed | 768 | **62.62** | **74.31** | **44.98** | 83.96 | **56.42** | 52.89 | 81.78 | **31.39** |
|
2916 |
+
| nomic-embed-text-v1.5 | 256 | 61.04 | 72.1 | 43.16 | 84.09 | 55.18 | 50.81 | 81.34 | |
|
2917 |
+
| modernbert-embed | 256 | 61.17 | 72.40 | 43.82 | 83.45 | 55.69 | 50.62 | 81.12 | 31.27 |
|
2918 |
+
|
2919 |
+
|
2920 |
|
2921 |
## Usage
|
2922 |
|
2923 |
+
You can use these models directly with the transformers library. Until the next transformers release, doing so requires installing `transformers` from `main`:
|
2924 |
|
2925 |
```bash
|
2926 |
pip install git+https://github.com/huggingface/transformers.git
|
|
|
2928 |
|
2929 |
Reminder, this model is trained similarly to Nomic Embed and **REQUIRES** prefixes to be added to the input. For more information, see the instructions in [Nomic Embed](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#task-instruction-prefixes).
|
2930 |
|
2931 |
+
Most use cases, adding `search_query: ` to the query and `search_document: ` to the documents will be sufficient.
|
2932 |
+
|
2933 |
+
### Sentence Transformers
|
2934 |
+
|
2935 |
+
```python
|
2936 |
+
from sentence_transformers import SentenceTransformer
|
2937 |
+
|
2938 |
+
model = SentenceTransformer("nomic-ai/modernbert-embed")
|
2939 |
+
|
2940 |
+
query_embeddings = model.encode([
|
2941 |
+
"search_query: What is TSNE?",
|
2942 |
+
"search_query: Who is Laurens van der Maaten?",
|
2943 |
+
])
|
2944 |
+
doc_embeddings = model.encode([
|
2945 |
+
"search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten",
|
2946 |
+
])
|
2947 |
+
print(query_embeddings.shape, doc_embeddings.shape)
|
2948 |
+
# (2, 768) (1, 768)
|
2949 |
+
|
2950 |
+
similarities = model.similarity(query_embeddings, doc_embeddings)
|
2951 |
+
print(similarities)
|
2952 |
+
# tensor([[0.7214],
|
2953 |
+
# [0.3260]])
|
2954 |
+
```
|
2955 |
+
|
2956 |
+
<details><summary>Click to see Sentence Transformers usage with Matryoshka Truncation</summary>
|
2957 |
+
|
2958 |
+
In Sentence Transformers, you can truncate embeddings to a smaller dimension by using the `truncate_dim` parameter when loading the `SentenceTransformer` model.
|
2959 |
+
|
2960 |
+
```python
|
2961 |
+
from sentence_transformers import SentenceTransformer
|
2962 |
+
|
2963 |
+
model = SentenceTransformer("nomic-ai/modernbert-embed", truncate_dim=256)
|
2964 |
+
|
2965 |
+
query_embeddings = model.encode([
|
2966 |
+
"search_query: What is TSNE?",
|
2967 |
+
"search_query: Who is Laurens van der Maaten?",
|
2968 |
+
])
|
2969 |
+
doc_embeddings = model.encode([
|
2970 |
+
"search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten",
|
2971 |
+
])
|
2972 |
+
print(query_embeddings.shape, doc_embeddings.shape)
|
2973 |
+
# (2, 256) (1, 256)
|
2974 |
+
|
2975 |
+
similarities = model.similarity(query_embeddings, doc_embeddings)
|
2976 |
+
print(similarities)
|
2977 |
+
# tensor([[0.7759],
|
2978 |
+
# [0.3419]])
|
2979 |
+
```
|
2980 |
+
|
2981 |
+
Note the small differences compared to the full 768-dimensional similarities.
|
2982 |
+
|
2983 |
+
</details>
|
2984 |
|
2985 |
### Transformers
|
2986 |
|
|
|
2989 |
import torch.nn.functional as F
|
2990 |
from transformers import AutoTokenizer, AutoModel
|
2991 |
|
2992 |
+
|
2993 |
def mean_pooling(model_output, attention_mask):
|
2994 |
token_embeddings = model_output[0]
|
2995 |
+
input_mask_expanded = (
|
2996 |
+
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
2997 |
+
)
|
2998 |
+
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
|
2999 |
+
input_mask_expanded.sum(1), min=1e-9
|
3000 |
+
)
|
3001 |
|
|
|
3002 |
|
3003 |
+
queries = ["search_query: What is TSNE?", "search_query: Who is Laurens van der Maaten?"]
|
3004 |
+
documents = ["search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten"]
|
|
|
3005 |
|
3006 |
+
tokenizer = AutoTokenizer.from_pretrained(".")
|
3007 |
+
model = AutoModel.from_pretrained(".")
|
3008 |
|
3009 |
+
encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
|
3010 |
+
encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")
|
3011 |
|
3012 |
with torch.no_grad():
|
3013 |
+
queries_outputs = model(**encoded_queries)
|
3014 |
+
documents_outputs = model(**encoded_documents)
|
3015 |
|
3016 |
+
query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
|
3017 |
+
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
|
3018 |
+
doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
|
3019 |
+
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)
|
3020 |
+
print(query_embeddings.shape, doc_embeddings.shape)
|
3021 |
+
# torch.Size([2, 768]) torch.Size([1, 768])
|
3022 |
|
3023 |
+
similarities = query_embeddings @ doc_embeddings.T
|
3024 |
+
print(similarities)
|
3025 |
+
# tensor([[0.7214],
|
3026 |
+
# [0.3260]])
|
3027 |
```
|
3028 |
|
3029 |
+
<details><summary>Click to see Transformers usage with Matryoshka Truncation</summary>
|
3030 |
+
|
3031 |
+
In `transformers`, you can truncate embeddings to a smaller dimension by slicing the mean pooled embeddings, prior to normalization.
|
3032 |
|
3033 |
```python
|
3034 |
+
import torch
|
3035 |
+
import torch.nn.functional as F
|
3036 |
+
from transformers import AutoTokenizer, AutoModel
|
3037 |
|
|
|
|
|
|
|
3038 |
|
3039 |
+
def mean_pooling(model_output, attention_mask):
|
3040 |
+
token_embeddings = model_output[0]
|
3041 |
+
input_mask_expanded = (
|
3042 |
+
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
3043 |
+
)
|
3044 |
+
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
|
3045 |
+
input_mask_expanded.sum(1), min=1e-9
|
3046 |
+
)
|
3047 |
+
|
3048 |
+
|
3049 |
+
queries = ["search_query: What is TSNE?", "search_query: Who is Laurens van der Maaten?"]
|
3050 |
+
documents = ["search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten"]
|
3051 |
+
|
3052 |
+
tokenizer = AutoTokenizer.from_pretrained(".")
|
3053 |
+
model = AutoModel.from_pretrained(".")
|
3054 |
+
truncate_dim = 256
|
3055 |
|
3056 |
+
encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
|
3057 |
+
encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")
|
3058 |
+
|
3059 |
+
with torch.no_grad():
|
3060 |
+
queries_outputs = model(**encoded_queries)
|
3061 |
+
documents_outputs = model(**encoded_documents)
|
3062 |
+
|
3063 |
+
query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
|
3064 |
+
query_embeddings = query_embeddings[:, :truncate_dim]
|
3065 |
+
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
|
3066 |
+
doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
|
3067 |
+
doc_embeddings = doc_embeddings[:, :truncate_dim]
|
3068 |
+
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)
|
3069 |
+
print(query_embeddings.shape, doc_embeddings.shape)
|
3070 |
+
# torch.Size([2, 256]) torch.Size([1, 256])
|
3071 |
+
|
3072 |
+
similarities = query_embeddings @ doc_embeddings.T
|
3073 |
print(similarities)
|
3074 |
+
# tensor([[0.7759],
|
3075 |
+
# [0.3419]])
|
3076 |
```
|
3077 |
|
3078 |
+
Note the small differences compared to the full 768-dimensional similarities.
|
3079 |
+
|
3080 |
+
</details>
|
3081 |
|
3082 |
## Training
|
3083 |
|