wangchuan98 commited on
Commit
3f02298
1 Parent(s): 74ba4aa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -1
README.md CHANGED
@@ -558,4 +558,55 @@ model-index:
558
  value: 85.9
559
  ---
560
  license: apache-2.0
561
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
558
  value: 85.9
559
  ---
560
  license: apache-2.0
561
+ ---
562
+
563
+ # Model Introduction
564
+ 360Zhinao-search uses the self-developed BERT model as the base for multi-task fine-tuning, which has an average score of 75.05 on the Retrieval task on the C-MTEB-Retrieval benchmark, currently ranking first.
565
+ [C-MTEB-Retrieval leaderboard](https://huggingface.co/spaces/mteb/leaderboard) contains a total of 8 [query, passage] similarity retrieval subtasks in different fields, using NDCG@10 (Normalized Discounted Cumulative Gain @ 10) as the evaluation index.
566
+
567
+ | Model | T2Retrieval | MMarcoRetrieval | DuRetrieval | CovidRetrieval | CmedqaRetrieval | EcomRetrieval | MedicalRetrieval | VideoRetrieval | Avg |
568
+ |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
569
+ |**360Zhinao-search** | 87.12 | 83.32 | 87.57 | 85.02 | 46.73 | 68.9 | 63.69 | 78.09 | **75.05** |
570
+ |AGE_Hybrid | 86.88 | 80.65 | 89.28 | 83.66 | 47.26 | 69.28 | 65.94 | 76.79 | 74.97 |
571
+ |OpenSearch-text-hybrid | 86.76 | 79.93 | 87.85 | 84.03 | 46.56 | 68.79 | 65.92 | 75.43 | 74.41 |
572
+ |piccolo-large-zh-v2 | 86.14 | 79.54 | 89.14 | 86.78 | 47.58 | 67.75 | 64.88 | 73.1 | 74.36 |
573
+ |stella-large-zh-v3-1792d | 85.56 | 79.14 | 87.13 | 82.44 | 46.87 | 68.62 | 65.18 | 73.89 | 73.6 |
574
+
575
+ ## Optimization points
576
+ 1. Data filtering: Strictly prevent the C-MTEB-Retrieval test data from leaking, and clean all queries and passages in the test set;
577
+ 2. Data source enhancement: Use open source data and LLM synthetic data to improve data diversity;
578
+ 3. Negative example mining: Use multiple methods to deeply mine difficult-to-distinguish negative examples to improve information gain;
579
+ 4. Training efficiency: multi-machine multi-CPU + Deepspeed method to optimize GPU memory utilization.
580
+
581
+
582
+ ## Usage
583
+ ```bash
584
+ from typing import cast, List, Dict, Union
585
+ from transformers import AutoModel, AutoTokenizer
586
+ import torch
587
+ import numpy as np
588
+
589
+ tokenizer = AutoTokenizer.from_pretrained('qihoo360/360Zhinao-search')
590
+ model = AutoModel.from_pretrained('qihoo360/360Zhinao-search')
591
+ sentences = ['天空是什么颜色的', '天空是蓝色的']
592
+ inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt', max_length=512)
593
+
594
+ if __name__ == "__main__":
595
+
596
+ with torch.no_grad():
597
+ last_hidden_state = model(**inputs, return_dict=True).last_hidden_state
598
+ embeddings = last_hidden_state[:, 0]
599
+ embeddings = torch.nn.functional.normalize(embeddings, dim=-1)
600
+ embeddings = embeddings.cpu().numpy()
601
+
602
+ print("embeddings:")
603
+ print(embeddings)
604
+
605
+ cos_sim = np.dot(embeddings[0], embeddings[1])
606
+ print("cos_sim:", cos_sim)
607
+
608
+ ```
609
+
610
+ ## Reference
611
+ [bge fine-tuning code](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
612
+ [C-MTEB official test script](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB)