Supposed to be same or better than v1?
Hi there,
Just wanted to highlight that one might want to experiment with v1 and v1.5 and compare them side-by-side before making a decision.
On my information retrieval tasks & eval data, nomic-embed-text-v1 shows MUCH better performance than v1.5.
- nomic-v1: MRR@10 = 0.41
- nomic-v1.5: MRR@10 = 0.27
Other metrics reported by InformationRetrievalEvaluator look similar, however, MRR is my primary one.
I'm not sure what the reason might be, but I figured it's worth sharing with the community!
P.S. I have a rather long documents, markdown webpages, that's why I turned to nomic in the first place.
Hm that's interesting! Would you mind sharing how you eval'd the model and what the data looks like?
Sure, I'm using sentence-transformers
to finetune / evaluate.
Here's a snippet:
evaluator = InformationRetrievalEvaluator(hard_queries, hard_corpus, hard_relevant_docs, corpus_chunk_size=BATCH_SIZE, batch_size=BATCH_SIZE, show_progress_bar=True, query_prompt='search_query: ', corpus_prompt='search_document: ')
I'm working with search queries like b2b marketing automation platforms
and documents being scraped from company websites, formatted as Markdown (median seq length ~ 1800 tokens).