Model Introduction

360Zhinao-search uses the self-developed BERT model as the base for multi-task fine-tuning, which has an average score of 75.05 on the Retrieval task on the C-MTEB-Retrieval benchmark, currently ranking first.

C-MTEB-Retrieval leaderboard contains a total of 8 [query, passage] similarity retrieval subtasks in different fields, using NDCG@10 (Normalized Discounted Cumulative Gain @ 10) as the evaluation index.

Model	T2Retrieval	MMarcoRetrieval	DuRetrieval	CovidRetrieval	CmedqaRetrieval	EcomRetrieval	MedicalRetrieval	VideoRetrieval	Avg
360Zhinao-search	87.12	83.32	87.57	85.02	46.73	68.9	63.69	78.09	75.05
AGE_Hybrid	86.88	80.65	89.28	83.66	47.26	69.28	65.94	76.79	74.97
OpenSearch-text-hybrid	86.76	79.93	87.85	84.03	46.56	68.79	65.92	75.43	74.41
piccolo-large-zh-v2	86.14	79.54	89.14	86.78	47.58	67.75	64.88	73.1	74.36
stella-large-zh-v3-1792d	85.56	79.14	87.13	82.44	46.87	68.62	65.18	73.89	73.6

Optimization points

Data filtering: Strictly prevent the C-MTEB-Retrieval test data from leaking, and clean all queries and passages in the test set;
Data source enhancement: Use open source data and LLM synthetic data to improve data diversity;
Negative example mining: Use multiple methods to deeply mine difficult-to-distinguish negative examples to improve information gain;
Training efficiency: multi-machine multi-GPU training + Deepspeed method to optimize GPU memory utilization.

Usage

from typing import cast, List, Dict, Union
from transformers import AutoModel, AutoTokenizer
import torch
import numpy as np

tokenizer = AutoTokenizer.from_pretrained('qihoo360/360Zhinao-search')
model = AutoModel.from_pretrained('qihoo360/360Zhinao-search')
sentences = ['天空是什么颜色的', '天空是蓝色的']
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt', max_length=512)

if __name__ == "__main__":

    with torch.no_grad():
        last_hidden_state = model(**inputs, return_dict=True).last_hidden_state
        embeddings = last_hidden_state[:, 0]
        embeddings = torch.nn.functional.normalize(embeddings, dim=-1)
        embeddings = embeddings.cpu().numpy()

    print("embeddings:")
    print(embeddings)

    cos_sim = np.dot(embeddings[0], embeddings[1])
    print("cos_sim:", cos_sim)

Reference

bge fine-tuning code

C-MTEB official test script

License

The source code of this repository follows the open-source license Apache 2.0.

360Zhinao open-source models support commercial use. If you wish to use these models or continue training them for commercial purposes, please contact us via email ([email protected]) to apply. For the specific license agreement, please see <<360 Zhinao Open-Source Model License>>.

qihoo360
/

360Zhinao-search

Model Introduction

Optimization points

Usage

Reference

License

Spaces using qihoo360/360Zhinao-search 2

Collection including qihoo360/360Zhinao-search

360Zhinao

Evaluation results