Model Introduction

360Zhinao-search uses the self-developed BERT model as the base for multi-task fine-tuning, which has an average score of 75.05 on the Retrieval task on the C-MTEB-Retrieval benchmark, currently ranking first.

C-MTEB-Retrieval leaderboard contains a total of 8 [query, passage] similarity retrieval subtasks in different fields, using NDCG@10 (Normalized Discounted Cumulative Gain @ 10) as the evaluation index.

Model T2Retrieval MMarcoRetrieval DuRetrieval CovidRetrieval CmedqaRetrieval EcomRetrieval MedicalRetrieval VideoRetrieval Avg
360Zhinao-search 87.12 83.32 87.57 85.02 46.73 68.9 63.69 78.09 75.05
AGE_Hybrid 86.88 80.65 89.28 83.66 47.26 69.28 65.94 76.79 74.97
OpenSearch-text-hybrid 86.76 79.93 87.85 84.03 46.56 68.79 65.92 75.43 74.41
piccolo-large-zh-v2 86.14 79.54 89.14 86.78 47.58 67.75 64.88 73.1 74.36
stella-large-zh-v3-1792d 85.56 79.14 87.13 82.44 46.87 68.62 65.18 73.89 73.6

Optimization points

  1. Data filtering: Strictly prevent the C-MTEB-Retrieval test data from leaking, and clean all queries and passages in the test set;
  2. Data source enhancement: Use open source data and LLM synthetic data to improve data diversity;
  3. Negative example mining: Use multiple methods to deeply mine difficult-to-distinguish negative examples to improve information gain;
  4. Training efficiency: multi-machine multi-GPU training + Deepspeed method to optimize GPU memory utilization.

Usage

from typing import cast, List, Dict, Union
from transformers import AutoModel, AutoTokenizer
import torch
import numpy as np

tokenizer = AutoTokenizer.from_pretrained('qihoo360/360Zhinao-search')
model = AutoModel.from_pretrained('qihoo360/360Zhinao-search')
sentences = ['天空是什么颜色的', '天空是蓝色的']
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt', max_length=512)

if __name__ == "__main__":

    with torch.no_grad():
        last_hidden_state = model(**inputs, return_dict=True).last_hidden_state
        embeddings = last_hidden_state[:, 0]
        embeddings = torch.nn.functional.normalize(embeddings, dim=-1)
        embeddings = embeddings.cpu().numpy()

    print("embeddings:")
    print(embeddings)

    cos_sim = np.dot(embeddings[0], embeddings[1])
    print("cos_sim:", cos_sim)

Reference

bge fine-tuning code

C-MTEB official test script

License

The source code of this repository follows the open-source license Apache 2.0.

360​Zhinao open-source models support commercial use. If you wish to use these models or continue training them for commercial purposes, please contact us via email ([email protected]) to apply. For the specific license agreement, please see <<360 Zhinao Open-Source Model License>>.

Downloads last month
462
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Spaces using qihoo360/360Zhinao-search 4

Collection including qihoo360/360Zhinao-search

Evaluation results