Fix typo; update README script + specific MRL snippets

#2
by tomaarsen HF staff - opened
Files changed (1) hide show
  1. README.md +133 -32
README.md CHANGED
@@ -2902,23 +2902,25 @@ base_model:
2902
 
2903
  # ModernBERT Embed
2904
 
2905
- ModernBERT Embed is an embedding model trained from [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), brining the new advances of ModernBERT to embeddings!
2906
 
2907
  Trained on the [Nomic Embed](https://arxiv.org/abs/2402.01613) weakly-supervised and supervised datasets, `modernbert-embed` also supports Matryoshka Representation Learning dimensions of 256, reducing memory by 3x with minimal performance loss.
2908
 
2909
  ## Performance
2910
 
2911
- | Model | Dimensions | Average (56) | Classification (12) | Clustering (11) | Pair Classification (3) | Reranking (4) | Retrieval (15) | STS (10) | Overall/Summ (1) |
2912
- |-------|------------|--------------|--------------------:|-----------------|------------------------|---------------|----------------|-----------|-----------------|
2913
- | nomic-embed-text-v1 | 768 | 62.4 | 74.1 | 43.9 | 85.2 | 55.7 | 52.8 | 82.1 | 30.1 |
2914
- | nomic-embed-text-v1.5 | 768 | 62.28 | 73.55 | 43.93 | 84.61 | 55.78 | 53.01 | 81.94 | 30.4 |
2915
- | ModernBERT | 768 | 62.62 | 74.31 | 44.98 | 83.96 | 56.42 | 52.89 | 81.78 | 31.39 |
2916
- | nomic-embed-text-v1.5 | 256 | 61.04 | 72.1 | 43.16 | 84.09 | 55.18 | 50.81 | 81.34|
2917
- | ModernBERT | 256 | 61.17 | 72.40 | 43.82 | 83.45 | 55.69 | 50.62 | 81.12 | 31.27 |
 
 
2918
 
2919
  ## Usage
2920
 
2921
- You can use these models directly with the transformers library. Until the next transformers release, doing so requires installing transformers from main:
2922
 
2923
  ```bash
2924
  pip install git+https://github.com/huggingface/transformers.git
@@ -2926,7 +2928,59 @@ pip install git+https://github.com/huggingface/transformers.git
2926
 
2927
  Reminder, this model is trained similarly to Nomic Embed and **REQUIRES** prefixes to be added to the input. For more information, see the instructions in [Nomic Embed](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#task-instruction-prefixes).
2928
 
2929
- Most use cases, adding `search_query` to the query and `search_document` to the documents will be sufficient.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2930
 
2931
  ### Transformers
2932
 
@@ -2935,48 +2989,95 @@ import torch
2935
  import torch.nn.functional as F
2936
  from transformers import AutoTokenizer, AutoModel
2937
 
 
2938
  def mean_pooling(model_output, attention_mask):
2939
  token_embeddings = model_output[0]
2940
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
2941
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
 
 
 
 
2942
 
2943
- sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
2944
 
2945
- tokenizer = AutoTokenizer.from_pretrained('nomic-ai/modernbert-embed')
2946
- model = AutoModel.from_pretrained('nomic-ai/modernbert-embed')
2947
- model.eval()
2948
 
2949
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
 
2950
 
2951
- matryoshka_dim = 768
 
2952
 
2953
  with torch.no_grad():
2954
- model_output = model(**encoded_input)
 
2955
 
 
 
 
 
 
 
2956
 
2957
- embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
2958
- embeddings = embeddings[:, :matryoshka_dim]
2959
- embeddings = F.normalize(embeddings, p=2, dim=1)
2960
- print(embeddings)
2961
  ```
2962
 
2963
- ### Sentence Transformers
 
 
2964
 
2965
  ```python
2966
- from sentence_transformers import SentenceTransformer
 
 
2967
 
2968
- model = SentenceTransformer(
2969
- "nomic-ai/modernbert-embed",
2970
- )
2971
 
2972
- # Verify that everything works as expected
2973
- embeddings = model.encode(['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?'])
2974
- print(embeddings.shape)
 
 
 
 
 
 
 
 
 
 
 
 
 
2975
 
2976
- similarities = model.similarity(embeddings, embeddings)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2977
  print(similarities)
 
 
2978
  ```
2979
 
 
 
 
2980
 
2981
  ## Training
2982
 
 
2902
 
2903
  # ModernBERT Embed
2904
 
2905
+ ModernBERT Embed is an embedding model trained from [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), bringing the new advances of ModernBERT to embeddings!
2906
 
2907
  Trained on the [Nomic Embed](https://arxiv.org/abs/2402.01613) weakly-supervised and supervised datasets, `modernbert-embed` also supports Matryoshka Representation Learning dimensions of 256, reducing memory by 3x with minimal performance loss.
2908
 
2909
  ## Performance
2910
 
2911
+ | Model | Dimensions | Average (56) | Classification (12) | Clustering (11) | Pair Classification (3) | Reranking (4) | Retrieval (15) | STS (10) | Overall/Summ (1) |
2912
+ |-----------------------|------------|--------------|---------------------|-----------------|-------------------------|---------------|----------------|-----------|------------------|
2913
+ | nomic-embed-text-v1 | 768 | 62.4 | 74.1 | 43.9 | **85.2** | 55.7 | 52.8 | 82.1 | 30.1 |
2914
+ | nomic-embed-text-v1.5 | 768 | 62.28 | 73.55 | 43.93 | 84.61 | 55.78 | **53.01** | **81.94** | 30.4 |
2915
+ | modernbert-embed | 768 | **62.62** | **74.31** | **44.98** | 83.96 | **56.42** | 52.89 | 81.78 | **31.39** |
2916
+ | nomic-embed-text-v1.5 | 256 | 61.04 | 72.1 | 43.16 | 84.09 | 55.18 | 50.81 | 81.34 | |
2917
+ | modernbert-embed | 256 | 61.17 | 72.40 | 43.82 | 83.45 | 55.69 | 50.62 | 81.12 | 31.27 |
2918
+
2919
+
2920
 
2921
  ## Usage
2922
 
2923
+ You can use these models directly with the transformers library. Until the next transformers release, doing so requires installing `transformers` from `main`:
2924
 
2925
  ```bash
2926
  pip install git+https://github.com/huggingface/transformers.git
 
2928
 
2929
  Reminder, this model is trained similarly to Nomic Embed and **REQUIRES** prefixes to be added to the input. For more information, see the instructions in [Nomic Embed](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#task-instruction-prefixes).
2930
 
2931
+ Most use cases, adding `search_query: ` to the query and `search_document: ` to the documents will be sufficient.
2932
+
2933
+ ### Sentence Transformers
2934
+
2935
+ ```python
2936
+ from sentence_transformers import SentenceTransformer
2937
+
2938
+ model = SentenceTransformer("nomic-ai/modernbert-embed")
2939
+
2940
+ query_embeddings = model.encode([
2941
+ "search_query: What is TSNE?",
2942
+ "search_query: Who is Laurens van der Maaten?",
2943
+ ])
2944
+ doc_embeddings = model.encode([
2945
+ "search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten",
2946
+ ])
2947
+ print(query_embeddings.shape, doc_embeddings.shape)
2948
+ # (2, 768) (1, 768)
2949
+
2950
+ similarities = model.similarity(query_embeddings, doc_embeddings)
2951
+ print(similarities)
2952
+ # tensor([[0.7214],
2953
+ # [0.3260]])
2954
+ ```
2955
+
2956
+ <details><summary>Click to see Sentence Transformers usage with Matryoshka Truncation</summary>
2957
+
2958
+ In Sentence Transformers, you can truncate embeddings to a smaller dimension by using the `truncate_dim` parameter when loading the `SentenceTransformer` model.
2959
+
2960
+ ```python
2961
+ from sentence_transformers import SentenceTransformer
2962
+
2963
+ model = SentenceTransformer("nomic-ai/modernbert-embed", truncate_dim=256)
2964
+
2965
+ query_embeddings = model.encode([
2966
+ "search_query: What is TSNE?",
2967
+ "search_query: Who is Laurens van der Maaten?",
2968
+ ])
2969
+ doc_embeddings = model.encode([
2970
+ "search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten",
2971
+ ])
2972
+ print(query_embeddings.shape, doc_embeddings.shape)
2973
+ # (2, 256) (1, 256)
2974
+
2975
+ similarities = model.similarity(query_embeddings, doc_embeddings)
2976
+ print(similarities)
2977
+ # tensor([[0.7759],
2978
+ # [0.3419]])
2979
+ ```
2980
+
2981
+ Note the small differences compared to the full 768-dimensional similarities.
2982
+
2983
+ </details>
2984
 
2985
  ### Transformers
2986
 
 
2989
  import torch.nn.functional as F
2990
  from transformers import AutoTokenizer, AutoModel
2991
 
2992
+
2993
  def mean_pooling(model_output, attention_mask):
2994
  token_embeddings = model_output[0]
2995
+ input_mask_expanded = (
2996
+ attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
2997
+ )
2998
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
2999
+ input_mask_expanded.sum(1), min=1e-9
3000
+ )
3001
 
 
3002
 
3003
+ queries = ["search_query: What is TSNE?", "search_query: Who is Laurens van der Maaten?"]
3004
+ documents = ["search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten"]
 
3005
 
3006
+ tokenizer = AutoTokenizer.from_pretrained(".")
3007
+ model = AutoModel.from_pretrained(".")
3008
 
3009
+ encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
3010
+ encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")
3011
 
3012
  with torch.no_grad():
3013
+ queries_outputs = model(**encoded_queries)
3014
+ documents_outputs = model(**encoded_documents)
3015
 
3016
+ query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
3017
+ query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
3018
+ doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
3019
+ doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)
3020
+ print(query_embeddings.shape, doc_embeddings.shape)
3021
+ # torch.Size([2, 768]) torch.Size([1, 768])
3022
 
3023
+ similarities = query_embeddings @ doc_embeddings.T
3024
+ print(similarities)
3025
+ # tensor([[0.7214],
3026
+ # [0.3260]])
3027
  ```
3028
 
3029
+ <details><summary>Click to see Transformers usage with Matryoshka Truncation</summary>
3030
+
3031
+ In `transformers`, you can truncate embeddings to a smaller dimension by slicing the mean pooled embeddings, prior to normalization.
3032
 
3033
  ```python
3034
+ import torch
3035
+ import torch.nn.functional as F
3036
+ from transformers import AutoTokenizer, AutoModel
3037
 
 
 
 
3038
 
3039
+ def mean_pooling(model_output, attention_mask):
3040
+ token_embeddings = model_output[0]
3041
+ input_mask_expanded = (
3042
+ attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
3043
+ )
3044
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
3045
+ input_mask_expanded.sum(1), min=1e-9
3046
+ )
3047
+
3048
+
3049
+ queries = ["search_query: What is TSNE?", "search_query: Who is Laurens van der Maaten?"]
3050
+ documents = ["search_document: TSNE is a dimensionality reduction algorithm created by Laurens van Der Maaten"]
3051
+
3052
+ tokenizer = AutoTokenizer.from_pretrained(".")
3053
+ model = AutoModel.from_pretrained(".")
3054
+ truncate_dim = 256
3055
 
3056
+ encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
3057
+ encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")
3058
+
3059
+ with torch.no_grad():
3060
+ queries_outputs = model(**encoded_queries)
3061
+ documents_outputs = model(**encoded_documents)
3062
+
3063
+ query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
3064
+ query_embeddings = query_embeddings[:, :truncate_dim]
3065
+ query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
3066
+ doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
3067
+ doc_embeddings = doc_embeddings[:, :truncate_dim]
3068
+ doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)
3069
+ print(query_embeddings.shape, doc_embeddings.shape)
3070
+ # torch.Size([2, 256]) torch.Size([1, 256])
3071
+
3072
+ similarities = query_embeddings @ doc_embeddings.T
3073
  print(similarities)
3074
+ # tensor([[0.7759],
3075
+ # [0.3419]])
3076
  ```
3077
 
3078
+ Note the small differences compared to the full 768-dimensional similarities.
3079
+
3080
+ </details>
3081
 
3082
  ## Training
3083