Kaguya-19 commited on
Commit
6b6c466
1 Parent(s): 0d1b94a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -20
README.md CHANGED
@@ -4,31 +4,31 @@ language:
4
  - en
5
  base_model: openbmb/MiniCPM-2B-sft-bf16
6
  ---
7
- ## RankCPM-E
8
 
9
- **RankCPM-E** 是面壁智能与清华大学自然语言处理实验室(THUNLP)共同开发的中英双语言文本嵌入模型,有如下特点:
10
  - 出色的中文、英文检索能力。
11
  - 出色的中英跨语言检索能力。
12
 
13
- RankCPM-E 基于 [MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) 训练,结构上采取双向注意力和 Weighted Mean Pooling [1]。采取多阶段训练方式,共使用包括开源数据、机造数据、闭源数据在内的约 600 万条训练数据。
14
 
15
  欢迎关注 RAG 套件系列:
16
 
17
- - 检索模型:[RankCPM-E](https://huggingface.co/openbmb/RankCPM-E)
18
- - 重排模型:[RankCPM-R](https://huggingface.co/openbmb/RankCPM-R)
19
  - 面向 RAG 场景的 LoRA 插件:[MiniCPM3-RAG-LoRA](https://huggingface.co/openbmb/MiniCPM3-RAG-LoRA)
20
 
21
- **RankCPM-E** is a bilingual & cross-lingual text embedding model developed by ModelBest Inc. and THUNLP, featuring:
22
 
23
  - Exceptional Chinese and English retrieval capabilities.
24
  - Outstanding cross-lingual retrieval capabilities between Chinese and English.
25
 
26
- RankCPM-E is trained based on [MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) and incorporates bidirectional attention and Weighted Mean Pooling [1] in its architecture. The model underwent multi-stage training using approximately 6 million training examples, including open-source, synthetic, and proprietary data.
27
 
28
  We also invite you to explore the RAG toolkit series:
29
 
30
- - Retrieval Model: [RankCPM-E](https://huggingface.co/openbmb/RankCPM-E)
31
- - Re-ranking Model: [RankCPM-R](https://huggingface.co/openbmb/RankCPM-R)
32
  - LoRA Plugin for RAG scenarios: [MiniCPM3-RAG-LoRA](https://huggingface.co/openbmb/MiniCPM3-RAG-LoRA)
33
 
34
  [1] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
@@ -49,7 +49,7 @@ We also invite you to explore the RAG toolkit series:
49
 
50
  本模型支持 query 侧指令,格式如下:
51
 
52
- RankCPM-E supports query-side instructions in the following format:
53
 
54
  ```
55
  Instruction: {{ instruction }} Query: {{ query }}
@@ -69,7 +69,7 @@ Instruction: Given a claim about climate change, retrieve documents that support
69
 
70
  也可以不提供指令,即采取如下格式:
71
 
72
- RankCPM-E also works in instruction-free mode in the following format:
73
 
74
  ```
75
  Query: {{ query }}
@@ -94,7 +94,7 @@ from transformers import AutoModel, AutoTokenizer
94
  import torch
95
  import torch.nn.functional as F
96
 
97
- model_name = "openbmb/RankCPM-E"
98
  tokenizer = AutoTokenizer.from_pretrained(model_name)
99
  model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
100
  model.eval()
@@ -152,8 +152,8 @@ print(scores.tolist()) # [[0.3535913825035095, 0.18596848845481873]]
152
  | gte-Qwen2-1.5B-instruct | 71.86 | 58.29 |
153
  | gte-Qwen2-7B-instruct | 76.03 | 60.25 |
154
  | bge-multilingual-gemma2 | 73.73 | 59.24 |
155
- | RankCPM-E | **76.76** | 58.56 |
156
- | RankCPM-E+RankCPM-R | 77.08 | 61.61 |
157
 
158
  ### 中英跨语言检索结果 CN-EN Cross-lingual Retrieval Results
159
 
@@ -164,15 +164,15 @@ print(scores.tolist()) # [[0.3535913825035095, 0.18596848845481873]]
164
  | gte-multilingual-base(Dense) | 68.2 | 39.46 | 45.86 |
165
  | gte-Qwen2-1.5B-instruct | 68.52 | 49.11 | 45.05 |
166
  | gte-Qwen2-7B-instruct | 68.27 | 49.14 | 49.6 |
167
- | RankCPM-E | **72.95** | **52.65** | **49.95** |
168
- | RankCPM-E+RankCPM-R | 74.33 | 53.21 | 54.12 |
169
 
170
  ## 许可证 License
171
 
172
  - 本仓库中代码依照 [Apache-2.0 协议](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE)开源。
173
- - RankCPM-E 模型权重的使用则需要遵循 [MiniCPM 模型协议](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md)。
174
- - RankCPM-E 模型权重对学术研究完全开放。如需将模型用于商业用途,请填写[此问卷](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)。
175
 
176
  * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
177
- * The usage of RankCPM-E model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
178
- * The models and weights of RankCPM-E are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, RankCPM-E weights are also available for free commercial use.
 
4
  - en
5
  base_model: openbmb/MiniCPM-2B-sft-bf16
6
  ---
7
+ ## MiniCPM-Embedding
8
 
9
+ **MiniCPM-Embedding** 是面壁智能与清华大学自然语言处理实验室(THUNLP)共同开发的中英双语言文本嵌入模型,有如下特点:
10
  - 出色的中文、英文检索能力。
11
  - 出色的中英跨语言检索能力。
12
 
13
+ MiniCPM-Embedding 基于 [MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) 训练,结构上采取双向注意力和 Weighted Mean Pooling [1]。采取多阶段训练方式,共使用包括开源数据、机造数据、闭源数据在内的约 600 万条训练数据。
14
 
15
  欢迎关注 RAG 套件系列:
16
 
17
+ - 检索模型:[MiniCPM-Embedding](https://huggingface.co/openbmb/MiniCPM-Embedding)
18
+ - 重排模型:[MiniCPM-Reranker](https://huggingface.co/openbmb/MiniCPM-Reranker)
19
  - 面向 RAG 场景的 LoRA 插件:[MiniCPM3-RAG-LoRA](https://huggingface.co/openbmb/MiniCPM3-RAG-LoRA)
20
 
21
+ **MiniCPM-Embedding** is a bilingual & cross-lingual text embedding model developed by ModelBest Inc. and THUNLP, featuring:
22
 
23
  - Exceptional Chinese and English retrieval capabilities.
24
  - Outstanding cross-lingual retrieval capabilities between Chinese and English.
25
 
26
+ MiniCPM-Embedding is trained based on [MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) and incorporates bidirectional attention and Weighted Mean Pooling [1] in its architecture. The model underwent multi-stage training using approximately 6 million training examples, including open-source, synthetic, and proprietary data.
27
 
28
  We also invite you to explore the RAG toolkit series:
29
 
30
+ - Retrieval Model: [MiniCPM-Embedding](https://huggingface.co/openbmb/MiniCPM-Embedding)
31
+ - Re-ranking Model: [MiniCPM-Reranker](https://huggingface.co/openbmb/MiniCPM-Reranker)
32
  - LoRA Plugin for RAG scenarios: [MiniCPM3-RAG-LoRA](https://huggingface.co/openbmb/MiniCPM3-RAG-LoRA)
33
 
34
  [1] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
 
49
 
50
  本模型支持 query 侧指令,格式如下:
51
 
52
+ MiniCPM-Embedding supports query-side instructions in the following format:
53
 
54
  ```
55
  Instruction: {{ instruction }} Query: {{ query }}
 
69
 
70
  也可以不提供指令,即采取如下格式:
71
 
72
+ MiniCPM-Embedding also works in instruction-free mode in the following format:
73
 
74
  ```
75
  Query: {{ query }}
 
94
  import torch
95
  import torch.nn.functional as F
96
 
97
+ model_name = "openbmb/MiniCPM-Embedding"
98
  tokenizer = AutoTokenizer.from_pretrained(model_name)
99
  model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
100
  model.eval()
 
152
  | gte-Qwen2-1.5B-instruct | 71.86 | 58.29 |
153
  | gte-Qwen2-7B-instruct | 76.03 | 60.25 |
154
  | bge-multilingual-gemma2 | 73.73 | 59.24 |
155
+ | MiniCPM-Embedding | **76.76** | 58.56 |
156
+ | MiniCPM-Embedding+MiniCPM-Reranker | 77.08 | 61.61 |
157
 
158
  ### 中英跨语言检索结果 CN-EN Cross-lingual Retrieval Results
159
 
 
164
  | gte-multilingual-base(Dense) | 68.2 | 39.46 | 45.86 |
165
  | gte-Qwen2-1.5B-instruct | 68.52 | 49.11 | 45.05 |
166
  | gte-Qwen2-7B-instruct | 68.27 | 49.14 | 49.6 |
167
+ | MiniCPM-Embedding | **72.95** | **52.65** | **49.95** |
168
+ | MiniCPM-Embedding+MiniCPM-Reranker | 74.33 | 53.21 | 54.12 |
169
 
170
  ## 许可证 License
171
 
172
  - 本仓库中代码依照 [Apache-2.0 协议](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE)开源。
173
+ - MiniCPM-Embedding 模型权重的使用则需要遵循 [MiniCPM 模型协议](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md)。
174
+ - MiniCPM-Embedding 模型权重对学术研究完全开放。如需将模型用于商业用途,请填写[此问卷](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)。
175
 
176
  * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
177
+ * The usage of MiniCPM-Embedding model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
178
+ * The models and weights of MiniCPM-Embedding are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-Embedding weights are also available for free commercial use.