spacemanidol commited on
Commit
915c599
·
verified ·
1 Parent(s): b30c8a2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -7
README.md CHANGED
@@ -2819,7 +2819,7 @@ license: apache-2.0
2819
  ## News
2820
 
2821
 
2822
- 04/16/2024: Release the ** Arctic-text-embed ** family of text empedding models. The releases are state-of-the-art for Retrieval quality at each of their representative size profiles. [Technical Report]() is coming shortly. For more details, please refer to our Github: [Arctic-Text-Embed](https://github.com/Snowflake/Arctic-Text-Embed).
2823
 
2824
 
2825
  ## Models
@@ -2828,7 +2828,7 @@ license: apache-2.0
2828
  Arctic-Embed is a suite of text embedding models that focuses on creating high-quality retrieval models optimized for performance.
2829
 
2830
 
2831
- The `arctic-text-embedding` models achieve **state-of-the-art performance on the MTEB/BEIR leaderboard** for each of their size variants. Evaluation is performed using these [scripts](https://github.com/Snowflake-Labs/arctic-embed/tree/main/src). As shown below, each class of model size achieves SOTA retrieval accuracy when compared to other top models.
2832
 
2833
 
2834
  The models are trained by leveraging existing open-source text representation models, such as bert-base-uncased, and are trained in a multi-stage pipeline to optimize their retrieval performance. First, the models are trained with large batches of query-document pairs where negatives are derived in-batch—pretraining leverages about 400m samples of a mix of public datasets and proprietary web search data. Following pretraining models are further optimized with long training on a smaller dataset (about 1m samples) of triplets of query, positive document, and negative document derived from hard harmful mining. Mining of the negatives and data curation is crucial to retrieval accuracy. A detailed technical report will be available shortly.
@@ -2936,7 +2936,7 @@ Based on the [intfloat/e5-large-unsupervised](https://huggingface.co/intfloat/e5
2936
  ### Using Huggingface transformers
2937
 
2938
 
2939
- To use an arctic-embed model, you can use the transformers package, as shown below. For optimal retrieval quality, ensure that you use the CLS token as the embedding for each portion of text and use the query prefix below (just on the query).
2940
 
2941
 
2942
 
@@ -2977,7 +2977,7 @@ for query, query_scores in zip(queries, scores):
2977
  ```
2978
 
2979
 
2980
- If you use the long context model and have more than 2048 tokens, ensure that you initialize the model like below instead. This will use [RPE](https://arxiv.org/abs/2104.09864) to allow up to 8192 tokens.
2981
 
2982
 
2983
  ``` py
@@ -2994,7 +2994,7 @@ TBD
2994
  ## Contact
2995
 
2996
 
2997
- If you have any questions or suggestions about this project, feel free to open an issue or pull request.
2998
  You also can email Daniel Campos([email protected]).
2999
 
3000
 
@@ -3007,7 +3007,11 @@ Arctic is licensed under the [Apache-2](https://www.apache.org/licenses/LICENSE-
3007
  ## Acknowledgement
3008
 
3009
 
3010
- We would like to thank the open-source community, which has provided the great building blocks upon which we could make our models.
3011
-
 
 
 
 
3012
 
3013
 
 
2819
  ## News
2820
 
2821
 
2822
+ 04/16/2024: Release the ** Arctic-embed ** family of text empedding models. The releases are state-of-the-art for Retrieval quality at each of their representative size profiles. [Technical Report]() is coming shortly. For more details, please refer to our Github: [Arctic-Text-Embed](https://github.com/Snowflake/Arctic-Text-Embed).
2823
 
2824
 
2825
  ## Models
 
2828
  Arctic-Embed is a suite of text embedding models that focuses on creating high-quality retrieval models optimized for performance.
2829
 
2830
 
2831
+ The `arctic-embedding` models achieve **state-of-the-art performance on the MTEB/BEIR leaderboard** for each of their size variants. Evaluation is performed using these [scripts](https://github.com/Snowflake-Labs/arctic-embed/tree/main/src). As shown below, each class of model size achieves SOTA retrieval accuracy compared to other top models.
2832
 
2833
 
2834
  The models are trained by leveraging existing open-source text representation models, such as bert-base-uncased, and are trained in a multi-stage pipeline to optimize their retrieval performance. First, the models are trained with large batches of query-document pairs where negatives are derived in-batch—pretraining leverages about 400m samples of a mix of public datasets and proprietary web search data. Following pretraining models are further optimized with long training on a smaller dataset (about 1m samples) of triplets of query, positive document, and negative document derived from hard harmful mining. Mining of the negatives and data curation is crucial to retrieval accuracy. A detailed technical report will be available shortly.
 
2936
  ### Using Huggingface transformers
2937
 
2938
 
2939
+ You can use the transformers package to use an arctic-embed model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion and use the query prefix below (just on the query).
2940
 
2941
 
2942
 
 
2977
  ```
2978
 
2979
 
2980
+ If you use the long context model with more than 2048 tokens, ensure that you initialize the model like below instead. This will use [RPE](https://arxiv.org/abs/2104.09864) to allow up to 8192 tokens.
2981
 
2982
 
2983
  ``` py
 
2994
  ## Contact
2995
 
2996
 
2997
+ Feel free to open an issue or pull request if you have any questions or suggestions about this project.
2998
  You also can email Daniel Campos([email protected]).
2999
 
3000
 
 
3007
  ## Acknowledgement
3008
 
3009
 
3010
+ We want to thank the open-source community, which has provided the great building blocks upon which we could make our models.
3011
+ We thank our modeling engineers, Danmei Xu, Luke Merrick, Gaurav Nuti, and Daniel Campos, for making these great models possible.
3012
+ We thank our leadership, Himabindu Pucha, Kelvin So, Vivek Raghunathan, and Sridhar Ramaswamy, for supporting this work.
3013
+ We also thank the open-source community for producing the great models we could build on top of and making these releases possible.
3014
+ Finally, we thank the researchers who created BEIR and MTEB benchmarks.
3015
+ It is largely thanks to their tireless work to define what better looks like that we could improve model performance.
3016
 
3017