Supposed to be same or better than v1?

#44
by persijano - opened

Hi there,

Just wanted to highlight that one might want to experiment with v1 and v1.5 and compare them side-by-side before making a decision.

On my information retrieval tasks & eval data, nomic-embed-text-v1 shows MUCH better performance than v1.5.

  • nomic-v1: MRR@10 = 0.41
  • nomic-v1.5: MRR@10 = 0.27

Other metrics reported by InformationRetrievalEvaluator look similar, however, MRR is my primary one.

I'm not sure what the reason might be, but I figured it's worth sharing with the community!

P.S. I have a rather long documents, markdown webpages, that's why I turned to nomic in the first place.

Nomic AI org

Hm that's interesting! Would you mind sharing how you eval'd the model and what the data looks like?

Sure, I'm using sentence-transformers to finetune / evaluate.

Here's a snippet:

evaluator = InformationRetrievalEvaluator(hard_queries, hard_corpus, hard_relevant_docs, corpus_chunk_size=BATCH_SIZE, batch_size=BATCH_SIZE, show_progress_bar=True, query_prompt='search_query: ', corpus_prompt='search_document: ')

I'm working with search queries like b2b marketing automation platforms and documents being scraped from company websites, formatted as Markdown (median seq length ~ 1800 tokens).

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment