spacemanidol commited on
Commit
d5643c7
·
verified ·
1 Parent(s): 915c599

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -33
README.md CHANGED
@@ -2836,10 +2836,10 @@ The models are trained by leveraging existing open-source text representation mo
2836
 
2837
  | Name | MTEB Retrieval Score (NDCG @ 10) | Parameters (Millions) | Embedding Dimension |
2838
  | ----------------------------------------------------------------------- | -------------------------------- | --------------------- | ------------------- |
2839
- | [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-xs/) | 50.15 | 22 | 384 |
2840
  | [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-s/) | 51.98 | 33 | 384 |
2841
- | [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-m/) | 54.90 | 110 | 768 |
2842
- | [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-m-long/) | 54.83 | 137 | 768 |
2843
  | [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-l/) | 55.98 | 335 | 1024 |
2844
 
2845
 
@@ -2848,32 +2848,32 @@ Aside from being great open-source models, the largest model, [arctic-embed-l](h
2848
 
2849
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2850
  | ------------------------------------------------------------------ | -------------------------------- |
2851
- | [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-l/) | 55.98 |
2852
  | Google-gecko-text-embedding | 55.7 |
2853
  | text-embedding-3-large | 55.44 |
2854
  | Cohere-embed-english-v3.0 | 55.00 |
2855
  | bge-large-en-v1.5 | 54.29 |
2856
 
2857
 
2858
- ### [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-xs/)
2859
 
2860
 
2861
- This tiny model packs quite the punch based on the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model. With only 22m parameters and 384 dimensions, this model should meet even the strictest latency/TCO budgets. Despite its size, its retrieval accuracy is closer to that of models with 100m paramers.
2862
 
2863
 
2864
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2865
  | ------------------------------------------------------------------- | -------------------------------- |
2866
- | [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-xs/) | 50.15 |
2867
  | GIST-all-MiniLM-L6-v2 | 45.12 |
2868
  | gte-tiny | 44.92 |
2869
  | all-MiniLM-L6-v2 | 41.95 |
2870
  | bge-micro-v2 | 42.56 |
2871
 
2872
 
2873
- ### Arctic-embed-m
2874
 
2875
 
2876
- Based on the [all-MiniLM-L12-v2](https://huggingface.co/intfloat/e5-base-unsupervised) model, this small model does not trade off retrieval accuracy for its small size. With only 33m parameters and 384 dimensions, this model should easily allow scaling to large datasets.
2877
 
2878
 
2879
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
@@ -2885,37 +2885,36 @@ Based on the [all-MiniLM-L12-v2](https://huggingface.co/intfloat/e5-base-unsuper
2885
  | e5-small-v2 | 49.04 |
2886
 
2887
 
2888
- ### [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-m-long/)
2889
 
2890
 
2891
- Based on the [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1) model, this long-context variant of our medium-sized model is perfect for workloads that can be constrained by the regular 512 token context of our other models. Without the use of RPE, this model supports up to 2048 tokens. With RPE, it can scale to 8192!
2892
 
2893
 
2894
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2895
  | ------------------------------------------------------------------ | -------------------------------- |
2896
- | [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-m/) | 54.90 |
2897
  | bge-base-en-v1.5 | 53.25 |
2898
- | nomic-embed-text-v1.5 | 53.01 |
2899
  | GIST-Embedding-v0 | 52.31 |
2900
  | gte-base | 52.31 |
2901
 
2902
-
2903
- ### Arctic-embed-m
2904
 
2905
 
2906
- Based on the [intfloat/e5-base-unsupervised](https://huggingface.co/intfloat/e5-base-unsupervised) model, this medium model is the workhorse that provides the best retrieval performance without slowing down inference.
2907
 
2908
 
2909
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2910
  | ------------------------------------------------------------------ | -------------------------------- |
2911
- | [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-m/) | 54.90 |
2912
- | bge-base-en-v1.5 | 53.25 |
2913
- | nomic-embed-text-v1.5 | 53.25 |
2914
- | GIST-Embedding-v0 | 52.31 |
2915
- | gte-base | 52.31 |
2916
 
2917
 
2918
- ### [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-l/)
2919
 
2920
 
2921
  Based on the [intfloat/e5-large-unsupervised](https://huggingface.co/intfloat/e5-large-unsupervised) model, this small model does not sacrifice retrieval accuracy for its small size.
@@ -2923,7 +2922,7 @@ Based on the [intfloat/e5-large-unsupervised](https://huggingface.co/intfloat/e5
2923
 
2924
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2925
  | ------------------------------------------------------------------ | -------------------------------- |
2926
- | [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-l/) | 55.98 |
2927
  | UAE-Large-V1 | 54.66 |
2928
  | bge-large-en-v1.5 | 54.29 |
2929
  | mxbai-embed-large-v1 | 54.39 |
@@ -2977,14 +2976,6 @@ for query, query_scores in zip(queries, scores):
2977
  ```
2978
 
2979
 
2980
- If you use the long context model with more than 2048 tokens, ensure that you initialize the model like below instead. This will use [RPE](https://arxiv.org/abs/2104.09864) to allow up to 8192 tokens.
2981
-
2982
-
2983
- ``` py
2984
- model = AutoModel.from_pretrained('Snowflake/arctic-embed-m-long', trust_remote_code=True, rotary_scaling_factor=2)
2985
- ```
2986
-
2987
-
2988
  ## FAQ
2989
 
2990
 
@@ -3013,5 +3004,3 @@ We thank our leadership, Himabindu Pucha, Kelvin So, Vivek Raghunathan, and Srid
3013
  We also thank the open-source community for producing the great models we could build on top of and making these releases possible.
3014
  Finally, we thank the researchers who created BEIR and MTEB benchmarks.
3015
  It is largely thanks to their tireless work to define what better looks like that we could improve model performance.
3016
-
3017
-
 
2836
 
2837
  | Name | MTEB Retrieval Score (NDCG @ 10) | Parameters (Millions) | Embedding Dimension |
2838
  | ----------------------------------------------------------------------- | -------------------------------- | --------------------- | ------------------- |
2839
+ | [arctic-embed-xs](https://huggingface.co/Snowflake/arctic-embed-xs/) | 50.15 | 22 | 384 |
2840
  | [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-s/) | 51.98 | 33 | 384 |
2841
+ | [arctic-embed-m](https://huggingface.co/Snowflake/arctic-embed-m/) | 54.90 | 110 | 768 |
2842
+ | [arctic-embed-m-long](https://huggingface.co/Snowflake/arctic-embed-m-long/) | 54.83 | 137 | 768 |
2843
  | [arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-l/) | 55.98 | 335 | 1024 |
2844
 
2845
 
 
2848
 
2849
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2850
  | ------------------------------------------------------------------ | -------------------------------- |
2851
+ | [arctic-embed-l](https://huggingface.co/Snowflake/arctic-embed-l/) | 55.98 |
2852
  | Google-gecko-text-embedding | 55.7 |
2853
  | text-embedding-3-large | 55.44 |
2854
  | Cohere-embed-english-v3.0 | 55.00 |
2855
  | bge-large-en-v1.5 | 54.29 |
2856
 
2857
 
2858
+ ### [Arctic-embed-xs](https://huggingface.co/Snowflake/arctic-embed-xs)
2859
 
2860
 
2861
+ This tiny model packs quite the punch. Based on the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model with only 22m parameters and 384 dimensions, this model should meet even the strictest latency/TCO budgets. Despite its size, its retrieval accuracy is closer to that of models with 100m paramers.
2862
 
2863
 
2864
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2865
  | ------------------------------------------------------------------- | -------------------------------- |
2866
+ | [arctic-embed-xs](https://huggingface.co/Snowflake/arctic-embed-xs/) | 50.15 |
2867
  | GIST-all-MiniLM-L6-v2 | 45.12 |
2868
  | gte-tiny | 44.92 |
2869
  | all-MiniLM-L6-v2 | 41.95 |
2870
  | bge-micro-v2 | 42.56 |
2871
 
2872
 
2873
+ ### [Arctic-embed-s](https://huggingface.co/Snowflake/arctic-embed-s)
2874
 
2875
 
2876
+ Based on the [all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) model, this small model does not trade off retrieval accuracy for its small size. With only 33m parameters and 384 dimensions, this model should easily allow scaling to large datasets.
2877
 
2878
 
2879
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
 
2885
  | e5-small-v2 | 49.04 |
2886
 
2887
 
2888
+ ### [Arctic-embed-m](https://huggingface.co/Snowflake/arctic-embed-m/)
2889
 
2890
 
2891
+ Based on the [intfloat/e5-base-unsupervised](https://huggingface.co/intfloat/e5-base-unsupervised) model, this medium model is the workhorse that provides the best retrieval performance without slowing down inference.
2892
 
2893
 
2894
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2895
  | ------------------------------------------------------------------ | -------------------------------- |
2896
+ | [arctic-embed-m](https://huggingface.co/Snowflake/arctic-embed-m/) | 54.90 |
2897
  | bge-base-en-v1.5 | 53.25 |
2898
+ | nomic-embed-text-v1.5 | 53.25 |
2899
  | GIST-Embedding-v0 | 52.31 |
2900
  | gte-base | 52.31 |
2901
 
2902
+ ### [arctic-embed-m-long](https://huggingface.co/Snowflake/arctic-embed-m-long/)
 
2903
 
2904
 
2905
+ Based on the [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1) model, this long-context variant of our medium-sized model is perfect for workloads that can be constrained by the regular 512 token context of our other models. Without the use of RPE, this model supports up to 2048 tokens. With RPE, it can scale to 8192!
2906
 
2907
 
2908
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2909
  | ------------------------------------------------------------------ | -------------------------------- |
2910
+ | [arctic-embed-m-long](https://huggingface.co/Snowflake/arctic-embed-m-long/) | 54.83 |
2911
+ | nomic-embed-text-v1.5 | 53.01 |
2912
+ | nomic-embed-text-v1 | 52.81 |
2913
+
2914
+
2915
 
2916
 
2917
+ ### [arctic-embed-l](https://huggingface.co/Snowflake/arctic-embed-l/)
2918
 
2919
 
2920
  Based on the [intfloat/e5-large-unsupervised](https://huggingface.co/intfloat/e5-large-unsupervised) model, this small model does not sacrifice retrieval accuracy for its small size.
 
2922
 
2923
  | Model Name | MTEB Retrieval Score (NDCG @ 10) |
2924
  | ------------------------------------------------------------------ | -------------------------------- |
2925
+ | [arctic-embed-l](https://huggingface.co/Snowflake/arctic-embed-l/) | 55.98 |
2926
  | UAE-Large-V1 | 54.66 |
2927
  | bge-large-en-v1.5 | 54.29 |
2928
  | mxbai-embed-large-v1 | 54.39 |
 
2976
  ```
2977
 
2978
 
 
 
 
 
 
 
 
 
2979
  ## FAQ
2980
 
2981
 
 
3004
  We also thank the open-source community for producing the great models we could build on top of and making these releases possible.
3005
  Finally, we thank the researchers who created BEIR and MTEB benchmarks.
3006
  It is largely thanks to their tireless work to define what better looks like that we could improve model performance.