File size: 8,057 Bytes
6b4f3f1
 
 
 
 
 
 
 
 
8c02c7a
 
 
6b4f3f1
8c02c7a
 
 
6b4f3f1
 
8c02c7a
 
6b4f3f1
8c02c7a
 
 
 
6b4f3f1
8c02c7a
 
 
6b4f3f1
 
8c02c7a
 
 
 
 
 
 
 
 
 
 
57ffbc3
8c02c7a
 
 
 
 
 
 
 
6b4f3f1
8c02c7a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b4f3f1
8c02c7a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b4f3f1
8c02c7a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b4f3f1
 
8c02c7a
 
 
0d1b94a
8c02c7a
 
 
 
 
 
6b4f3f1
 
8c02c7a
 
 
 
6b4f3f1
 
8c02c7a
 
6b4f3f1
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
language:
- zh
- en
base_model: openbmb/MiniCPM-2B-sft-bf16
---
## RankCPM-E

**RankCPM-E** 是面壁智能与清华大学自然语言处理实验室(THUNLP)共同开发的中英双语言文本嵌入模型,有如下特点:
- 出色的中文、英文检索能力。
- 出色的中英跨语言检索能力。

RankCPM-E 基于 [MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) 训练,结构上采取双向注意力和 Weighted Mean Pooling [1]。采取多阶段训练方式,共使用包括开源数据、机造数据、闭源数据在内的约 600 万条训练数据。

欢迎关注 RAG 套件系列:

- 检索模型:[RankCPM-E](https://huggingface.co/openbmb/RankCPM-E)
- 重排模型:[RankCPM-R](https://huggingface.co/openbmb/RankCPM-R)
- 面向 RAG 场景的 LoRA 插件:[MiniCPM3-RAG-LoRA](https://huggingface.co/openbmb/MiniCPM3-RAG-LoRA)

**RankCPM-E** is a bilingual & cross-lingual text embedding model developed by ModelBest Inc. and THUNLP, featuring:

- Exceptional Chinese and English retrieval capabilities.
- Outstanding cross-lingual retrieval capabilities between Chinese and English.

RankCPM-E is trained based on [MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) and incorporates bidirectional attention and Weighted Mean Pooling [1] in its architecture. The model underwent multi-stage training using approximately 6 million training examples, including open-source, synthetic, and proprietary data.

We also invite you to explore the RAG toolkit series:

- Retrieval Model: [RankCPM-E](https://huggingface.co/openbmb/RankCPM-E)
- Re-ranking Model: [RankCPM-R](https://huggingface.co/openbmb/RankCPM-R)
- LoRA Plugin for RAG scenarios: [MiniCPM3-RAG-LoRA](https://huggingface.co/openbmb/MiniCPM3-RAG-LoRA)

[1] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.

## 模型信息 Model Information

- 模型大小:2.4B
- 嵌入维度:2304
- 最大输入token数:512

- Model Size: 2.4B
- Embedding Dimension: 2304
- Max Input Tokens: 512

## 使用方法 Usage

### 输入格式 Input Format

本模型支持 query 侧指令,格式如下:

RankCPM-E supports query-side instructions in the following format:

```
Instruction: {{ instruction }} Query: {{ query }}
```

例如:

For example:

```
Instruction: 为这个医学问题检索相关回答。Query: 咽喉癌的成因是什么?
```

```
Instruction: Given a claim about climate change, retrieve documents that support or refute the claim. Query: However the warming trend is slower than most climate models have forecast.
```

也可以不提供指令,即采取如下格式:

RankCPM-E also works in instruction-free mode in the following format:

```
Query: {{ query }}
```

我们在 BEIR 与 C-MTEB/Retrieval 上测试时使用的指令见 `instructions.json`,其他测试不使用指令。文档侧直接输入文档原文。

When running evaluation on BEIR and C-MTEB/Retrieval, we use instructions in `instructions.json`. For other evaluations, we do not use instructions. On the document side, we directly use the bare document as the input.

### 环境要求 Requirements

```
transformers==4.37.2
flash-attn>2.3.5
```

### 示例脚本 Demo

```python

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

model_name = "openbmb/RankCPM-E"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
model.eval()

def weighted_mean_pooling(hidden, attention_mask):
    attention_mask_ = attention_mask * attention_mask.cumsum(dim=1)
    s = torch.sum(hidden * attention_mask_.unsqueeze(-1).float(), dim=1)
    d = attention_mask_.sum(dim=1, keepdim=True).float()
    reps = s / d
    return reps

@torch.no_grad()
def encode(input_texts):
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True).to("cuda")
    
    outputs = model(**batch_dict)
    attention_mask = batch_dict["attention_mask"]
    hidden = outputs.last_hidden_state

    reps = weighted_mean_pooling(hidden, attention_mask)   
    embeddings = F.normalize(reps, p=2, dim=1).detach().cpu().numpy()
    return embeddings

queries = ["中国的首都是哪里?"]
passages = ["beijing", "shanghai"]


INSTRUCTION = "Query: "
queries = [INSTRUCTION + query for query in queries]

embeddings_query = encode(queries)
embeddings_doc = encode(passages)

scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist())  # [[0.3535913825035095, 0.18596848845481873]]
```

## 实验结果 Evaluation Results

### 中文与英文检索结果 CN/EN Retrieval Results

| 模型 Model                    | C-MTEB/Retrieval (NDCG@10) | BEIR (NDCG@10) |
|------------------------------|-------------------|---------------|
| bge-large-zh-v1.5            | 70.46             | -             |
| gte-large-zh                 | 72.49             | -             |
| Zhihui_LLM_Embedding         | 76.74             |               |
| bge-large-en-v1.5            | -                 | 54.29         |
| gte-en-large-v1.5            | -                 | 57.91         |
| NV-Retriever-v1              | -                 | 60.9          |
| bge-en-icl                   | -                 | 62.16         |
| NV-Embed-v2                  | -                 | 62.65         |
| me5-large                    | 63.66             | 51.43         |
| bge-m3(Dense)                | 65.43             | 48.82         |
| gte-multilingual-base(Dense) | 71.95             | 51.08         |
| gte-Qwen2-1.5B-instruct      | 71.86             | 58.29         |
| gte-Qwen2-7B-instruct        | 76.03             | 60.25         |
| bge-multilingual-gemma2      | 73.73             | 59.24         |
| RankCPM-E                    | **76.76**         | 58.56         |
| RankCPM-E+RankCPM-R         | 77.08             | 61.61         |

### 中英跨语言检索结果 CN-EN Cross-lingual Retrieval Results

| 模型  Model                | MKQA En-Zh_CN (Recall@20) | NeuCLIR22 (NDCG@10) | NeuCLIR23 (NDCG@10) |
|------------------------------|--------------------|--------------------|--------------------|
| me5-large                    | 44.3               | 9.01               | 25.33              |
| bge-m3(Dense)                | 66.4               | 30.49              | 41.09              |
| gte-multilingual-base(Dense) | 68.2               | 39.46              | 45.86              |
| gte-Qwen2-1.5B-instruct      | 68.52              | 49.11              | 45.05              |
| gte-Qwen2-7B-instruct        | 68.27              | 49.14              | 49.6               |
| RankCPM-E                    | **72.95**          | **52.65**          | **49.95**          |
| RankCPM-E+RankCPM-R         | 74.33              | 53.21              | 54.12              |

## 许可证 License

- 本仓库中代码依照 [Apache-2.0 协议](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE)开源。
- RankCPM-E 模型权重的使用则需要遵循 [MiniCPM 模型协议](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md)。
- RankCPM-E 模型权重对学术研究完全开放。如需将模型用于商业用途,请填写[此问卷](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)。

* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. 
* The usage of RankCPM-E model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
* The models and weights of RankCPM-E are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, RankCPM-E weights are also available for free commercial use.