zpn MaxNomic commited on
Commit
b53d557
·
verified ·
1 Parent(s): 1d03a35

remove details about v1 from other checkpoint (#4)

Browse files

- remove details about v1 from other checkpoint (869be4070611ad5b66a9349cdcfd72040ac5813e)


Co-authored-by: Max Cembalest <[email protected]>

Files changed (1) hide show
  1. README.md +2 -102
README.md CHANGED
@@ -2612,110 +2612,10 @@ model-index:
2612
  # nomic-embed-text-v1-unsupervised: A Reproducible Long Context (8192) Text Embedder
2613
 
2614
  `nomic-embed-text-v1-unsupervised` is 8192 context length text encoder. This is a checkpoint after contrastive pretraining from multi-stage contrastive training of the
2615
- [final model](https://huggingface.co/nomic-ai/nomic-embed-text-v1). If you want to extract embeddings, we suggest using [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1)
2616
- .
2617
 
 
2618
 
2619
- | Name | SeqLen | MTEB | LoCo | Jina Long Context | Open Weights | Open Training Code | Open Data |
2620
- | :-------------------------------:| :----- | :-------- | :------: | :---------------: | :-----------: | :----------------: | :---------- |
2621
- | nomic-embed-text-v1 | 8192 | **62.39** |**85.53** | 54.16 | ✅ | ✅ | ✅ |
2622
- | jina-embeddings-v2-base-en | 8192 | 60.39 | 85.45 | 51.90 | ✅ | ❌ | ❌ |
2623
- | text-embedding-3-small | 8191 | 62.26 | 82.40 | **58.20** | ❌ | ❌ | ❌ |
2624
- | text-embedding-ada-002 | 8191 | 60.99 | 52.7 | 55.25 | ❌ | ❌ | ❌ |
2625
-
2626
-
2627
- If you would like to finetune a model on more data, you can use this model as an initialization
2628
-
2629
- ## Hosted Inference API
2630
-
2631
- The easiest way to get started with Nomic Embed is through the Nomic Embedding API.
2632
-
2633
- Generating embeddings with the `nomic` Python client is as easy as
2634
-
2635
- ```python
2636
- from nomic import embed
2637
-
2638
- output = embed.text(
2639
- texts=['Nomic Embedding API', '#keepAIOpen'],
2640
- model='nomic-embed-text-v1',
2641
- task_type='search_document'
2642
- )
2643
-
2644
- print(output)
2645
- ```
2646
-
2647
- For more information, see the [API reference](https://docs.nomic.ai/reference/endpoints/nomic-embed-text)
2648
-
2649
- ## Data Visualization
2650
- Click the Nomic Atlas map below to visualize a 5M sample of our contrastive pretraining data!
2651
-
2652
-
2653
- [![image/webp](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/pjhJhuNyRfPagRd_c_iUz.webp)](https://atlas.nomic.ai/map/nomic-text-embed-v1-5m-sample)
2654
-
2655
-
2656
- ## Training Details
2657
-
2658
- We train our embedder using a multi-stage training pipeline. Starting from a long-context [BERT model](https://huggingface.co/nomic-ai/nomic-bert-2048),
2659
- the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.
2660
-
2661
- In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.
2662
-
2663
- For more details, see the Nomic Embed [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1).
2664
-
2665
- Training data to train the models is released in its entirety. For more details, see the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
2666
-
2667
- ## Usage
2668
-
2669
- Note `nomic-embed-text` requires prefixes! We support the prefixes `[search_query, search_document, classification, clustering]`.
2670
- For retrieval applications, you should prepend `search_document` for all your documents and `search_query` for your queries.
2671
-
2672
- ### Sentence Transformers
2673
- ```python
2674
- from sentence_transformers import SentenceTransformer
2675
-
2676
- model = SentenceTransformer("nomic-ai/nomic-embed-text-v1-unsupervised", trust_remote_code=True)
2677
- sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
2678
- embeddings = model.encode(sentences)
2679
- print(embeddings)
2680
- ```
2681
-
2682
- ### Transformers
2683
- ```python
2684
- import torch
2685
- import torch.nn.functional as F
2686
- from transformers import AutoTokenizer, AutoModel
2687
-
2688
- def mean_pooling(model_output, attention_mask):
2689
- token_embeddings = model_output[0]
2690
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
2691
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
2692
-
2693
- sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
2694
-
2695
- tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
2696
- model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-unsupervised', trust_remote_code=True)
2697
- model.eval()
2698
-
2699
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
2700
-
2701
- with torch.no_grad():
2702
- model_output = model(**encoded_input)
2703
-
2704
- embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
2705
- embeddings = F.normalize(embeddings, p=2, dim=1)
2706
- print(embeddings)
2707
- ```
2708
-
2709
- The model natively supports scaling of the sequence length past 2048 tokens. To do so,
2710
-
2711
- ```diff
2712
- - tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
2713
- + tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', model_max_length=8192)
2714
-
2715
-
2716
- - model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-unsupervised', trust_remote_code=True)
2717
- + model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-unsupervised', trust_remote_code=True, rotary_scaling_factor=2)
2718
- ```
2719
 
2720
  # Join the Nomic Community
2721
 
 
2612
  # nomic-embed-text-v1-unsupervised: A Reproducible Long Context (8192) Text Embedder
2613
 
2614
  `nomic-embed-text-v1-unsupervised` is 8192 context length text encoder. This is a checkpoint after contrastive pretraining from multi-stage contrastive training of the
2615
+ [final model](https://huggingface.co/nomic-ai/nomic-embed-text-v1). The purpose of releasing this checkpoint is to open-source training artifacts from our Nomic Embed Text tech report [here](https://arxiv.org/pdf/2402.01613)
 
2616
 
2617
+ If you want to use a model to extract embeddings, we suggest using [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1).
2618
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2619
 
2620
  # Join the Nomic Community
2621