wangchuan98
commited on
Commit
•
3f02298
1
Parent(s):
74ba4aa
Update README.md
Browse files
README.md
CHANGED
@@ -558,4 +558,55 @@ model-index:
|
|
558 |
value: 85.9
|
559 |
---
|
560 |
license: apache-2.0
|
561 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
558 |
value: 85.9
|
559 |
---
|
560 |
license: apache-2.0
|
561 |
+
---
|
562 |
+
|
563 |
+
# Model Introduction
|
564 |
+
360Zhinao-search uses the self-developed BERT model as the base for multi-task fine-tuning, which has an average score of 75.05 on the Retrieval task on the C-MTEB-Retrieval benchmark, currently ranking first.
|
565 |
+
[C-MTEB-Retrieval leaderboard](https://huggingface.co/spaces/mteb/leaderboard) contains a total of 8 [query, passage] similarity retrieval subtasks in different fields, using NDCG@10 (Normalized Discounted Cumulative Gain @ 10) as the evaluation index.
|
566 |
+
|
567 |
+
| Model | T2Retrieval | MMarcoRetrieval | DuRetrieval | CovidRetrieval | CmedqaRetrieval | EcomRetrieval | MedicalRetrieval | VideoRetrieval | Avg |
|
568 |
+
|:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
|
569 |
+
|**360Zhinao-search** | 87.12 | 83.32 | 87.57 | 85.02 | 46.73 | 68.9 | 63.69 | 78.09 | **75.05** |
|
570 |
+
|AGE_Hybrid | 86.88 | 80.65 | 89.28 | 83.66 | 47.26 | 69.28 | 65.94 | 76.79 | 74.97 |
|
571 |
+
|OpenSearch-text-hybrid | 86.76 | 79.93 | 87.85 | 84.03 | 46.56 | 68.79 | 65.92 | 75.43 | 74.41 |
|
572 |
+
|piccolo-large-zh-v2 | 86.14 | 79.54 | 89.14 | 86.78 | 47.58 | 67.75 | 64.88 | 73.1 | 74.36 |
|
573 |
+
|stella-large-zh-v3-1792d | 85.56 | 79.14 | 87.13 | 82.44 | 46.87 | 68.62 | 65.18 | 73.89 | 73.6 |
|
574 |
+
|
575 |
+
## Optimization points
|
576 |
+
1. Data filtering: Strictly prevent the C-MTEB-Retrieval test data from leaking, and clean all queries and passages in the test set;
|
577 |
+
2. Data source enhancement: Use open source data and LLM synthetic data to improve data diversity;
|
578 |
+
3. Negative example mining: Use multiple methods to deeply mine difficult-to-distinguish negative examples to improve information gain;
|
579 |
+
4. Training efficiency: multi-machine multi-CPU + Deepspeed method to optimize GPU memory utilization.
|
580 |
+
|
581 |
+
|
582 |
+
## Usage
|
583 |
+
```bash
|
584 |
+
from typing import cast, List, Dict, Union
|
585 |
+
from transformers import AutoModel, AutoTokenizer
|
586 |
+
import torch
|
587 |
+
import numpy as np
|
588 |
+
|
589 |
+
tokenizer = AutoTokenizer.from_pretrained('qihoo360/360Zhinao-search')
|
590 |
+
model = AutoModel.from_pretrained('qihoo360/360Zhinao-search')
|
591 |
+
sentences = ['天空是什么颜色的', '天空是蓝色的']
|
592 |
+
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt', max_length=512)
|
593 |
+
|
594 |
+
if __name__ == "__main__":
|
595 |
+
|
596 |
+
with torch.no_grad():
|
597 |
+
last_hidden_state = model(**inputs, return_dict=True).last_hidden_state
|
598 |
+
embeddings = last_hidden_state[:, 0]
|
599 |
+
embeddings = torch.nn.functional.normalize(embeddings, dim=-1)
|
600 |
+
embeddings = embeddings.cpu().numpy()
|
601 |
+
|
602 |
+
print("embeddings:")
|
603 |
+
print(embeddings)
|
604 |
+
|
605 |
+
cos_sim = np.dot(embeddings[0], embeddings[1])
|
606 |
+
print("cos_sim:", cos_sim)
|
607 |
+
|
608 |
+
```
|
609 |
+
|
610 |
+
## Reference
|
611 |
+
[bge fine-tuning code](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune)
|
612 |
+
[C-MTEB official test script](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB)
|