zpn MaxNomic commited on
Commit
7d94890
·
verified ·
1 Parent(s): e4af278

remove main model info (#4)

Browse files

- remove main model info (33f5ac5a100e0532159bac2694046db258151bd2)


Co-authored-by: Max Cembalest <[email protected]>

Files changed (1) hide show
  1. README.md +2 -102
README.md CHANGED
@@ -2604,110 +2604,10 @@ model-index:
2604
 
2605
  # nomic-embed-text-v1-ablated: A Reproducible Long Context (8192) Text Embedder
2606
 
2607
- `nomic-embed-text-v1-ablated` is 8192 context length text encoder that surpasses OpenAI text-embedding-ada-002 performance on short and long tasks.
2608
- .
2609
 
 
2610
 
2611
- | Name | SeqLen | MTEB | LoCo | Jina Long Context | Open Weights | Open Training Code | Open Data |
2612
- | :-------------------------------:| :----- | :-------- | :------: | :---------------: | :-----------: | :----------------: | :---------- |
2613
- | nomic-embed-text-v1 | 8192 | **62.39** |**85.53** | 54.16 | ✅ | ✅ | ✅ |
2614
- | jina-embeddings-v2-base-en | 8192 | 60.39 | 85.45 | 51.90 | ✅ | ❌ | ❌ |
2615
- | text-embedding-3-small | 8191 | 62.26 | 82.40 | **58.20** | ❌ | ❌ | ❌ |
2616
- | text-embedding-ada-002 | 8191 | 60.99 | 52.7 | 55.25 | ❌ | ❌ | ❌ |
2617
-
2618
-
2619
- If you would like to finetune a model on more data, you can use this model as an initialization
2620
-
2621
- ## Hosted Inference API
2622
-
2623
- The easiest way to get started with Nomic Embed is through the Nomic Embedding API.
2624
-
2625
- Generating embeddings with the `nomic` Python client is as easy as
2626
-
2627
- ```python
2628
- from nomic import embed
2629
-
2630
- output = embed.text(
2631
- texts=['Nomic Embedding API', '#keepAIOpen'],
2632
- model='nomic-embed-text-v1',
2633
- task_type='search_document'
2634
- )
2635
-
2636
- print(output)
2637
- ```
2638
-
2639
- For more information, see the [API reference](https://docs.nomic.ai/reference/endpoints/nomic-embed-text)
2640
-
2641
- ## Data Visualization
2642
- Click the Nomic Atlas map below to visualize a 5M sample of our contrastive pretraining data!
2643
-
2644
-
2645
- [![image/webp](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/pjhJhuNyRfPagRd_c_iUz.webp)](https://atlas.nomic.ai/map/nomic-text-embed-v1-5m-sample)
2646
-
2647
- ## Training Details
2648
-
2649
- We train our embedder using a multi-stage training pipeline. Starting from a long-context [BERT model](https://huggingface.co/nomic-ai/nomic-bert-2048),
2650
- the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.
2651
-
2652
- In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.
2653
-
2654
- For more details, see the Nomic Embed [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1).
2655
-
2656
- Training data to train the models is released in its entirety. For more details, see the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
2657
-
2658
- ## Usage
2659
-
2660
- Note `nomic-embed-text` requires prefixes! We support the prefixes `[search_query, search_document, classification, clustering]`.
2661
- For retrieval applications, you should prepend `search_document` for all your documents and `search_query` for your queries.
2662
-
2663
- ### Sentence Transformers
2664
- ```python
2665
- from sentence_transformers import SentenceTransformer
2666
-
2667
- model = SentenceTransformer("nomic-ai/nomic-embed-text-v1-ablated", trust_remote_code=True)
2668
- sentences = ['search_query: What is TSNE?', 'search_query Who is Laurens van der Maaten?']
2669
- embeddings = model.encode(sentences)
2670
- print(embeddings)
2671
- ```
2672
-
2673
- ### Transformers
2674
-
2675
- ```python
2676
- import torch
2677
- import torch.nn.functional as F
2678
- from transformers import AutoTokenizer, AutoModel
2679
-
2680
- def mean_pooling(model_output, attention_mask):
2681
- token_embeddings = model_output[0]
2682
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
2683
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
2684
-
2685
- sentences = ['search_query: What is TSNE?', 'search_query: Who is Laurens van der Maaten?']
2686
-
2687
- tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
2688
- model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-ablated', trust_remote_code=True)
2689
- model.eval()
2690
-
2691
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
2692
-
2693
- with torch.no_grad():
2694
- model_output = model(**encoded_input)
2695
-
2696
- embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
2697
- embeddings = F.normalize(embeddings, p=2, dim=1)
2698
- print(embeddings)
2699
- ```
2700
-
2701
- The model natively supports scaling of the sequence length past 2048 tokens. To do so,
2702
-
2703
- ```diff
2704
- - tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
2705
- + tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', model_max_length=8192)
2706
-
2707
-
2708
- - model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-ablated', trust_remote_code=True)
2709
- + model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1-ablated', trust_remote_code=True, rotary_scaling_factor=2)
2710
- ```
2711
 
2712
  # Join the Nomic Community
2713
 
 
2604
 
2605
  # nomic-embed-text-v1-ablated: A Reproducible Long Context (8192) Text Embedder
2606
 
2607
+ `nomic-embed-text-v1-ablated` is 8192 context length text encoder. This is a checkpoint trained after modifying the training dataset to be different from the dataset used to train our [final model](https://huggingface.co/nomic-ai/nomic-embed-text-v1). The purpose of releasing this checkpoint is to understand the impact that subsets of our training data had on model outcomes. This release is part of our commitment to open-source training artifacts from our Nomic Embed Text tech report [here](https://arxiv.org/pdf/2402.01613)
 
2608
 
2609
+ If you want to use a model to extract embeddings, we suggest using [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1).
2610
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2611
 
2612
  # Join the Nomic Community
2613