qihoo360
/

360Zhinao-search

Inference Endpoints

Model card Files Files and versions Community

wangchuan98 commited on May 15

Commit

3f02298

•

1 Parent(s): 74ba4aa

Update README.md

Files changed (1) hide show

README.md +52 -1

README.md CHANGED Viewed

@@ -558,4 +558,55 @@ model-index:
       value: 85.9
 ---
 license: apache-2.0
----

       value: 85.9
 ---
 license: apache-2.0
+---
+# Model Introduction
+360Zhinao-search uses the self-developed BERT model as the base for multi-task fine-tuning, which has an average score of 75.05 on the Retrieval task on the C-MTEB-Retrieval benchmark, currently ranking first.
+[C-MTEB-Retrieval leaderboard](https://huggingface.co/spaces/mteb/leaderboard) contains a total of 8 [query, passage] similarity retrieval subtasks in different fields, using NDCG@10 (Normalized Discounted Cumulative Gain @ 10) as the evaluation index.
+| Model | T2Retrieval | MMarcoRetrieval | DuRetrieval | CovidRetrieval | CmedqaRetrieval | EcomRetrieval | MedicalRetrieval | VideoRetrieval | Avg |
+|:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
+|**360Zhinao-search** | 87.12 | 83.32 | 87.57 | 85.02 | 46.73 | 68.9 | 63.69 | 78.09 | **75.05** |
+|AGE_Hybrid | 86.88 | 80.65 | 89.28 | 83.66 | 47.26 | 69.28 | 65.94 | 76.79 | 74.97 |
+|OpenSearch-text-hybrid | 86.76 | 79.93 | 87.85 | 84.03 | 46.56 | 68.79 | 65.92 | 75.43 | 74.41 |
+|piccolo-large-zh-v2 | 86.14 | 79.54 | 89.14 | 86.78 | 47.58 | 67.75 | 64.88 | 73.1 | 74.36 |
+|stella-large-zh-v3-1792d | 85.56 | 79.14 | 87.13 | 82.44 | 46.87 | 68.62 | 65.18 | 73.89 | 73.6 |
+## Optimization points
+1. Data filtering: Strictly prevent the C-MTEB-Retrieval test data from leaking, and clean all queries and passages in the test set;
+2. Data source enhancement: Use open source data and LLM synthetic data to improve data diversity;
+3. Negative example mining: Use multiple methods to deeply mine difficult-to-distinguish negative examples to improve information gain;
+4. Training efficiency: multi-machine multi-CPU + Deepspeed method to optimize GPU memory utilization.
+## Usage
+```bash
+from typing import cast, List, Dict, Union
+from transformers import AutoModel, AutoTokenizer
+import torch
+import numpy as np
+tokenizer = AutoTokenizer.from_pretrained('qihoo360/360Zhinao-search')
+model = AutoModel.from_pretrained('qihoo360/360Zhinao-search')
+sentences = ['天空是什么颜色的', '天空是蓝色的']
+inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt', max_length=512)
+if __name__ == "__main__":
+    with torch.no_grad():
+        last_hidden_state = model(**inputs, return_dict=True).last_hidden_state
+        embeddings = last_hidden_state[:, 0]
+        embeddings = torch.nn.functional.normalize(embeddings, dim=-1)
+        embeddings = embeddings.cpu().numpy()
+    print("embeddings:")
+    print(embeddings)
+    cos_sim = np.dot(embeddings[0], embeddings[1])
+    print("cos_sim:", cos_sim)
+```
+## Reference
+[bge fine-tuning code](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
+[C-MTEB official test script](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB)