renillhuang
commited on
Commit
•
dc9ef78
1
Parent(s):
3e59fda
readme: Update tables with background color and bold best score
Browse files- README.md +48 -47
- README_zh.md +46 -48
README.md
CHANGED
@@ -96,54 +96,54 @@ Model release and download links are provided in the table below:
|
|
96 |
|
97 |
## 3.1. Base Model Orion-MOE8x7B-Base Benchmarks
|
98 |
### 3.1.1. LLM evaluation results on examination and professional knowledge
|
99 |
-
|TestSet|Mixtral 8x7B|Qwen1.5-32b|Qwen2.5-32b|Orion 14B|Orion
|
100 |
-
|
|
101 |
-
|
|
102 |
-
|
|
103 |
-
|
|
104 |
-
|
|
105 |
-
|ARC_c
|
106 |
-
|HellaSwag
|
107 |
-
|LAMBADA
|
108 |
-
|BBH
|
109 |
-
|MuSR
|
110 |
-
|PIQA
|
111 |
-
|CommonSenseQA| 69.
|
112 |
-
|IFEval
|
113 |
-
|
|
114 |
-
|HumanEval
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|MATH | 28.40 | 36.10 | 48.88 | 7.84 | 23.68 |
|
119 |
|
120 |
### 3.1.2. Comparison of LLM performances on Japanese testsets
|
121 |
-
|
|
122 |
-
|
123 |
-
|Mixtral-8x7B
|
124 |
-
|Qwen1.5-32B
|
125 |
-
|Qwen2.5-32B
|
126 |
-
|Orion-14B-
|
127 |
-
|Orion
|
128 |
|
129 |
### 3.1.3. Comparison of LLM performances on Korean testsets
|
130 |
-
|Model
|
131 |
-
|
132 |
-
|Mixtral-8x7B
|
133 |
-
|Qwen1.5-32B
|
134 |
-
|Qwen2.5-32B
|
135 |
-
|Orion-14B-
|
136 |
-
|Orion
|
|
|
137 |
|
138 |
### 3.1.4. Comparison of LLM performances on Arabic, German, French, and Spanish testsets
|
139 |
-
|
|
140 |
|----|----|----|----|----|----|----|----|----|
|
141 |
|**Model**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|
|
142 |
-
|Mixtral-8x7B
|
143 |
-
|Qwen1.5-32B
|
144 |
-
|Qwen2.5-32B
|
145 |
-
|Orion-14B
|
146 |
-
|Orion
|
147 |
|
148 |
### 3.1.5. Leakage Detection Benchmark
|
149 |
When the pre-training data of a large language model contains content from a specific dataset, the model’s performance on that dataset may be artificially enhanced, leading to inaccurate performance evaluations. To address this issue, researchers from the Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, and other institutions have proposed a simple and effective method for detecting data leakage. This method leverages the interchangeable nature of multiple-choice options by shuffling the options in the original dataset to generate derived data. The log-probability distribution of the derived dataset is then computed using the model to detect whether the original dataset has been leaked.
|
@@ -152,28 +152,29 @@ We conducted data leakage detection experiments on three benchmark datasets: MML
|
|
152 |
More details can be found in the paper: https://web3.arxiv.org/pdf/2409.01790.<br>
|
153 |
Test code: https://github.com/nishiwen1214/Benchmark-leakage-detection.
|
154 |
|
155 |
-
|Threshold 0.2|Qwen2.5 32B|Qwen1.5 32B|Orion
|
156 |
|------|------|------|------|------|------|
|
157 |
-
|MMLU | 0.30 | 0.27 | 0.22 | 0.28 | 0.25 |
|
158 |
-
|CEval | 0.39 | 0.38 | 0.27 | 0.26 | 0.26 |
|
159 |
-
|CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
|
160 |
|
161 |
### 3.1.6. Inference speed
|
162 |
Setup inference server on 8x Nvidia RTX3090, and get results from client in unit of 'tokens per second'.
|
163 |
|Models | 8x3090 1concurrent | 8x3090 4concurrent | 4xA100 1concurrent | 4xA100 4concurrent|
|
164 |
|---------|--------|-------|--------|-------|
|
165 |
-
|OrionMOE | 102.77 | 54.61 | 107.76 | 61.83 |
|
166 |
|Qwen32 | 52.93 | 46.06 | 62.43 | 56.81 |
|
167 |
|
168 |
<br>
|
169 |
We also tested on a 4x A100, comparing inference speeds based on different input lengths (tokens), get results from client in unit of 'tokens per second'.
|
170 |
|
171 |
-
|
|
172 |
|---------|-------|-------|-------|-------|-------|-------|
|
173 |
-
|OrionMOE | 90.86 | 54.40 | 31.08 | 29.04 | 22.69 | 14.51 |
|
174 |
|Qwen32 | 53.99 | 47.59 | 25.98 | 24.35 | 18.64 | 11.86 |
|
175 |
|
176 |
|
|
|
177 |
<a name="model-inference"></a><br>
|
178 |
# 4. Model Inference
|
179 |
|
|
|
96 |
|
97 |
## 3.1. Base Model Orion-MOE8x7B-Base Benchmarks
|
98 |
### 3.1.1. LLM evaluation results on examination and professional knowledge
|
99 |
+
|TestSet|Mixtral 8x7B|Qwen1.5-32b|Qwen2.5-32b|Orion 14B|Orion MOE8x7B|
|
100 |
+
| -------------- | ---- | ---- | ---- | ---- | ---- |
|
101 |
+
| MMLU | 70.4 | 73.4 | 82.9 | 69.9 | <span style="background-color: #add8e6;">**85.9**</span> |
|
102 |
+
| MMLU Pro | 38.5 | 45.3 | 58.0 | 34.0 | <span style="background-color: #add8e6;">**58.3**</span> |
|
103 |
+
| CEval | 54.1 | 83.5 | 87.7 | 72.8 | <span style="background-color: #add8e6;">**89.7**</span> |
|
104 |
+
| CMMLU | 53.2 | 82.3 | 89.0 | 70.6 | <span style="background-color: #add8e6;">**89.2**</span> |
|
105 |
+
| ARC_c | 85.1 | 90.2 | **94.2** | 79.7 | <span style="background-color: #add8e6;">91.9</span> |
|
106 |
+
| HellaSwag | 81.9 | 82.0 | 82.5 | 78.5 | <span style="background-color: #add8e6;">**89.2**</span> |
|
107 |
+
| LAMBADA | 76.8 | 73.7 | 75.4 | 78.8 | <span style="background-color: #add8e6;">**79.7**</span> |
|
108 |
+
| BBH | 50.9 | 57.3 | **67.7** | 50.4 | <span style="background-color: #add8e6;">55.8</span> |
|
109 |
+
| MuSR | 43.2 | 42.7 | 49.8 | 43.6 | <span style="background-color: #add8e6;">**49.9**</span> |
|
110 |
+
| PIQA | 83.4 | 82.2 | 80.1 | 79.5 | <span style="background-color: #add8e6;">**87.3**</span> |
|
111 |
+
| CommonSenseQA | 69.6 | **74.7** | 73.0 | 66.9 | <span style="background-color: #add8e6;">73.1</span> |
|
112 |
+
| IFEval | 24.2 | 33.0 | **41.6** | 29.1 | <span style="background-color: #add8e6;">30.1</span> |
|
113 |
+
| GQPA | 30.9 | 33.5 | 49.5 | 28.5 | <span style="background-color: #add8e6;">**52.2**</span> |
|
114 |
+
| HumanEval | 33.5 | 36.0 | **47.0** | 20.1 | <span style="background-color: #add8e6;">44.5</span> |
|
115 |
+
|
116 |
+
|
117 |
+
|
|
|
118 |
|
119 |
### 3.1.2. Comparison of LLM performances on Japanese testsets
|
120 |
+
|Model |Average|JSQuAD|JCommonSenseQA|JNLI|MARC-ja|JAQKET v2|PAWS-ja|
|
121 |
+
|-------------|-------|-------|---------------|-----|-------|---------|-------|
|
122 |
+
|Mixtral-8x7B |<span style="background-color: #ffffe0;">69.8</span> |89.0 |78.7 |32.1 |95.4 |78.9 |44.5 |
|
123 |
+
|Qwen1.5-32B |<span style="background-color: #ffffe0;">74.7</span> |89.9 |84.5 |51.0 |97.1 |82.1 |43.8 |
|
124 |
+
|Qwen2.5-32B |<span style="background-color: #ffffe0;">80.7</span> |89.1 |93.8 |72.1 |**97.9** |**89.3** |42.2 |
|
125 |
+
|Orion-14B |<span style="background-color: #ffffe0;">74.2</span> |74.2 |88.2 |72.8 |94.1 |66.2 |49.9 |
|
126 |
+
|Orion-MOE8x7B|<span style="background-color: #ffffe0;">**82.9**</span> |<span style="background-color: #add8e6;">**91.8**</span> |<span style="background-color: #add8e6;">90.4</span> |<span style="background-color: #add8e6;">**90.5**</span> |<span style="background-color: #add8e6;">96.4</span> |<span style="background-color: #add8e6;">81.2</span> |<span style="background-color: #add8e6;">**47.4**</span> |
|
127 |
|
128 |
### 3.1.3. Comparison of LLM performances on Korean testsets
|
129 |
+
|Model|Average|HAE-RAE|KoBEST BoolQ|KoBEST COPA|KoBEST HellaSwag|KoBEST SentiNeg|KoBEST WiC|PAWS-ko|
|
130 |
+
|-----|-------|-------|------------|-----------|----------------|---------------|----------|-------|
|
131 |
+
|Mixtral-8x7B |<span style="background-color: #ffffe0;">60.7</span> |53.2 |78.6 |66.2 |56.6 |77.1 |49.4 |44.1 |
|
132 |
+
|Qwen1.5-32B |<span style="background-color: #ffffe0;">58.6</span> |46.4 |76.3 |60.4 |53.0 |78.3 |52.1 |43.4 |
|
133 |
+
|Qwen2.5-32B |<span style="background-color: #ffffe0;">71.4</span> |**70.7** |80.3 |76.7 |**61.2** |96.5 |**77.2** |37.1 |
|
134 |
+
|Orion-14B |<span style="background-color: #ffffe0;">67.7</span> |69.7 |80.6 |77.1 |58.2 |92.4 |51.2 |44.6 |
|
135 |
+
|Orion-MOE8x7B|<span style="background-color: #ffffe0;">**72.0**</span> |<span style="background-color: #add8e6;">65.2</span> |<span style="background-color: #add8e6;">**85.4**</span> |<span style="background-color: #add8e6;">**80.4**</span> |<span style="background-color: #add8e6;">56.0</span> |<span style="background-color: #add8e6;">**97.0**</span> |<span style="background-color: #add8e6;">73.6</span> |<span style="background-color: #add8e6;">**46.4**</span> |
|
136 |
+
|
137 |
|
138 |
### 3.1.4. Comparison of LLM performances on Arabic, German, French, and Spanish testsets
|
139 |
+
| Language | Spanish | | French | | German | | Arabic | |
|
140 |
|----|----|----|----|----|----|----|----|----|
|
141 |
|**Model**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|
|
142 |
+
|Mixtral-8x7B |74.3 |54.8 |73.9 |55.9 |69.2 |52.4 |47.9 |36.3 |
|
143 |
+
|Qwen1.5-32B |70.5 |55.1 |68.9 |56.0 |63.8 |50.8 |50.1 |40.0 |
|
144 |
+
|Qwen2.5-32B |75.0 |65.3 |74.2 |62.7 |69.8 |61.8 |59.8 |52.9 |
|
145 |
+
|Orion-14B |62.0 |44.6 |60.2 |42.3 |54.7 |38.9 |42.3 |33.9 |
|
146 |
+
|Orion-MOE8x7B|<span style="background-color: #add8e6;">**87.4**</span> |<span style="background-color: #add8e6;">**70.1**</span> |<span style="background-color: #add8e6;">**85.6**</span> |<span style="background-color: #add8e6;">**68.8**</span> |<span style="background-color: #add8e6;">**80.6**</span> |<span style="background-color: #add8e6;">**63.5**</span> |<span style="background-color: #add8e6;">**69.4**</span> |<span style="background-color: #add8e6;">**54.3</span>** |
|
147 |
|
148 |
### 3.1.5. Leakage Detection Benchmark
|
149 |
When the pre-training data of a large language model contains content from a specific dataset, the model’s performance on that dataset may be artificially enhanced, leading to inaccurate performance evaluations. To address this issue, researchers from the Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, and other institutions have proposed a simple and effective method for detecting data leakage. This method leverages the interchangeable nature of multiple-choice options by shuffling the options in the original dataset to generate derived data. The log-probability distribution of the derived dataset is then computed using the model to detect whether the original dataset has been leaked.
|
|
|
152 |
More details can be found in the paper: https://web3.arxiv.org/pdf/2409.01790.<br>
|
153 |
Test code: https://github.com/nishiwen1214/Benchmark-leakage-detection.
|
154 |
|
155 |
+
|Threshold 0.2|Qwen2.5 32B|Qwen1.5 32B|Orion MOE8x7B|Orion 14B|Mixtral 8x7B|
|
156 |
|------|------|------|------|------|------|
|
157 |
+
|MMLU | 0.30 | 0.27 | <span style="background-color: #add8e6;">**0.22**</span> | 0.28 | 0.25 |
|
158 |
+
|CEval | 0.39 | 0.38 | <span style="background-color: #add8e6;">0.27</span> | **0.26** | **0.26** |
|
159 |
+
|CMMLU | 0.38 | 0.39 | <span style="background-color: #add8e6;">0.23</span> | 0.27 | **0.22** |
|
160 |
|
161 |
### 3.1.6. Inference speed
|
162 |
Setup inference server on 8x Nvidia RTX3090, and get results from client in unit of 'tokens per second'.
|
163 |
|Models | 8x3090 1concurrent | 8x3090 4concurrent | 4xA100 1concurrent | 4xA100 4concurrent|
|
164 |
|---------|--------|-------|--------|-------|
|
165 |
+
|OrionMOE | <span style="background-color: #add8e6;">**102.77**</span> | <span style="background-color: #add8e6;">**54.61**</span> | <span style="background-color: #add8e6;">**107.76**</span> | <span style="background-color: #add8e6;">**61.83**</span> |
|
166 |
|Qwen32 | 52.93 | 46.06 | 62.43 | 56.81 |
|
167 |
|
168 |
<br>
|
169 |
We also tested on a 4x A100, comparing inference speeds based on different input lengths (tokens), get results from client in unit of 'tokens per second'.
|
170 |
|
171 |
+
| Input | 4k | 8k | 12k | 16k | 32k | 64k |
|
172 |
|---------|-------|-------|-------|-------|-------|-------|
|
173 |
+
|OrionMOE | <span style="background-color: #add8e6;">**90.86**</span> | <span style="background-color: #add8e6;">**54.40**</span> | <span style="background-color: #add8e6;">**31.08**</span> | <span style="background-color: #add8e6;">**29.04**</span> | <span style="background-color: #add8e6;">**22.69**</span> | <span style="background-color: #add8e6;">**14.51**</span> |
|
174 |
|Qwen32 | 53.99 | 47.59 | 25.98 | 24.35 | 18.64 | 11.86 |
|
175 |
|
176 |
|
177 |
+
|
178 |
<a name="model-inference"></a><br>
|
179 |
# 4. Model Inference
|
180 |
|
README_zh.md
CHANGED
@@ -89,54 +89,51 @@
|
|
89 |
## 3.1. 基座模型Orion-MOE8x7B-Base评估
|
90 |
|
91 |
### 3.1.1. 基座模型基准测试对比
|
92 |
-
|TestSet|Mixtral 8x7B|Qwen1.5-32b|Qwen2.5-32b|Orion 14B|Orion
|
93 |
-
|
|
94 |
-
|
|
95 |
-
|
|
96 |
-
|
|
97 |
-
|
|
98 |
-
|ARC_c
|
99 |
-
|HellaSwag
|
100 |
-
|LAMBADA
|
101 |
-
|BBH
|
102 |
-
|MuSR
|
103 |
-
|PIQA
|
104 |
-
|CommonSenseQA| 69.
|
105 |
-
|IFEval
|
106 |
-
|
|
107 |
-
|HumanEval
|
108 |
-
|
109 |
-
|MATH Lv5 | 9.00 | 25.00 | 31.72 | 2.54 | 5.07 |
|
110 |
-
|GSM8K | 47.50 | 77.40 | 80.36 | 52.01 | 59.82 |
|
111 |
-
|MATH | 28.40 | 36.10 | 48.88 | 7.84 | 23.68 |
|
112 |
|
113 |
### 3.1.2. 小语种: 日文
|
114 |
-
|
|
115 |
-
|
116 |
-
|Mixtral-8x7B
|
117 |
-
|Qwen1.5-32B
|
118 |
-
|Qwen2.5-32B
|
119 |
-
|Orion-14B-
|
120 |
-
|Orion
|
121 |
|
122 |
### 3.1.3. 小语种: 韩文
|
123 |
-
|Model
|
124 |
-
|
125 |
-
|Mixtral-8x7B
|
126 |
-
|Qwen1.5-32B
|
127 |
-
|Qwen2.5-32B
|
128 |
-
|Orion-14B-
|
129 |
-
|Orion
|
130 |
|
131 |
### 3.1.4. 小语种: 阿拉伯语,德语,法语,西班牙语
|
132 |
-
|
|
133 |
|----|----|----|----|----|----|----|----|----|
|
134 |
|**Model**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|
|
135 |
-
|Mixtral-8x7B
|
136 |
-
|Qwen1.5-32B
|
137 |
-
|Qwen2.5-32B
|
138 |
-
|Orion-14B
|
139 |
-
|Orion
|
140 |
|
141 |
### 3.1.5. 泄漏检测结果
|
142 |
当大型语言模型的预训练数据包含特定数据集的内容时,该模型在该数据集上的表现可能会被人为提高,从而导致不准确的性能评估。为了解决这个问题,来自中国科学院深圳先进技术研究院和其他机构的研究人员提出了一种简单有效的数据泄露检测方法。该方法利用多选项的可互换性,通过打乱原始数据集中的选项生成派生数据。然后,使用模型计算派生数据集的对数概率分布,以检测原始数据集是否存在泄露。
|
@@ -145,28 +142,29 @@
|
|
145 |
更多细节可以在论文中找到:https://web3.arxiv.org/pdf/2409.01790。<br>
|
146 |
测试代码:https://github.com/nishiwen1214/Benchmark-leakage-detection。
|
147 |
|
148 |
-
|Threshold 0.2|Qwen2.5 32B|Qwen1.5 32B|Orion
|
149 |
|------|------|------|------|------|------|
|
150 |
-
|MMLU | 0.30 | 0.27 | 0.22 | 0.28 | 0.25 |
|
151 |
-
|CEval | 0.39 | 0.38 | 0.27 | 0.26 | 0.26 |
|
152 |
-
|CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
|
153 |
|
154 |
### 3.1.6. 推理速度
|
155 |
搭建基于8卡Nvidia RTX3090以及4卡Nvidia A100,采用"token/秒"为单位,从客户端统计测试结果。
|
156 |
|Models | 8x3090 1concurrent | 8x3090 4concurrent | 4xA100 1concurrent | 4xA100 4concurrent|
|
157 |
|---------|--------|-------|--------|-------|
|
158 |
-
|OrionMOE | 102.77 | 54.61 | 107.76 | 61.83 |
|
159 |
|Qwen32 | 52.93 | 46.06 | 62.43 | 56.81 |
|
160 |
|
161 |
<br>
|
162 |
同时测试了4卡A100上,基于不同输入长度(tokens)的推理速度比较,采用"token/秒"为单位,从客户端统计测试结果。
|
163 |
|
164 |
-
|
|
165 |
|---------|-------|-------|-------|-------|-------|-------|
|
166 |
-
|OrionMOE | 90.86 | 54.40 | 31.08 | 29.04 | 22.69 | 14.51 |
|
167 |
|Qwen32 | 53.99 | 47.59 | 25.98 | 24.35 | 18.64 | 11.86 |
|
168 |
|
169 |
|
|
|
170 |
<a name="zh_model-inference"></a><br>
|
171 |
# 4. 模型推理
|
172 |
|
@@ -262,4 +260,4 @@ Engineering 、Agent开发的全链条能力和经验积累;拥有完整的端
|
|
262 |
|
263 |
<div align="center">
|
264 |
<img src="./assets/imgs/wechat_group.jpg" alt="wechat" width="40%" />
|
265 |
-
</div>
|
|
|
89 |
## 3.1. 基座模型Orion-MOE8x7B-Base评估
|
90 |
|
91 |
### 3.1.1. 基座模型基准测试对比
|
92 |
+
|TestSet|Mixtral 8x7B|Qwen1.5-32b|Qwen2.5-32b|Orion 14B|Orion MOE8x7B|
|
93 |
+
| -------------- | ---- | ---- | ---- | ---- | ---- |
|
94 |
+
| MMLU | 70.4 | 73.4 | 82.9 | 69.9 | <span style="background-color: #add8e6;">**85.9**</span> |
|
95 |
+
| MMLU Pro | 38.5 | 45.3 | 58.0 | 34.0 | <span style="background-color: #add8e6;">**58.3**</span> |
|
96 |
+
| CEval | 54.1 | 83.5 | 87.7 | 72.8 | <span style="background-color: #add8e6;">**89.7**</span> |
|
97 |
+
| CMMLU | 53.2 | 82.3 | 89.0 | 70.6 | <span style="background-color: #add8e6;">**89.2**</span> |
|
98 |
+
| ARC_c | 85.1 | 90.2 | **94.2** | 79.7 | <span style="background-color: #add8e6;">91.9</span> |
|
99 |
+
| HellaSwag | 81.9 | 82.0 | 82.5 | 78.5 | <span style="background-color: #add8e6;">**89.2**</span> |
|
100 |
+
| LAMBADA | 76.8 | 73.7 | 75.4 | 78.8 | <span style="background-color: #add8e6;">**79.7**</span> |
|
101 |
+
| BBH | 50.9 | 57.3 | **67.7** | 50.4 | <span style="background-color: #add8e6;">55.8</span> |
|
102 |
+
| MuSR | 43.2 | 42.7 | 49.8 | 43.6 | <span style="background-color: #add8e6;">**49.9**</span> |
|
103 |
+
| PIQA | 83.4 | 82.2 | 80.1 | 79.5 | <span style="background-color: #add8e6;">**87.3**</span> |
|
104 |
+
| CommonSenseQA | 69.6 | **74.7** | 73.0 | 66.9 | <span style="background-color: #add8e6;">73.1</span> |
|
105 |
+
| IFEval | 24.2 | 33.0 | **41.6** | 29.1 | <span style="background-color: #add8e6;">30.1</span> |
|
106 |
+
| GQPA | 30.9 | 33.5 | 49.5 | 28.5 | <span style="background-color: #add8e6;">**52.2**</span> |
|
107 |
+
| HumanEval | 33.5 | 36.0 | **47.0** | 20.1 | <span style="background-color: #add8e6;">44.5</span> |
|
108 |
+
|
|
|
|
|
|
|
109 |
|
110 |
### 3.1.2. 小语种: 日文
|
111 |
+
|Model |Average|JSQuAD|JCommonSenseQA|JNLI|MARC-ja|JAQKET v2|PAWS-ja|
|
112 |
+
|-------------|-------|-------|---------------|-----|-------|---------|-------|
|
113 |
+
|Mixtral-8x7B |<span style="background-color: #ffffe0;">69.8</span> |89.0 |78.7 |32.1 |95.4 |78.9 |44.5 |
|
114 |
+
|Qwen1.5-32B |<span style="background-color: #ffffe0;">74.7</span> |89.9 |84.5 |51.0 |97.1 |82.1 |43.8 |
|
115 |
+
|Qwen2.5-32B |<span style="background-color: #ffffe0;">80.7</span> |89.1 |93.8 |72.1 |**97.9** |**89.3** |42.2 |
|
116 |
+
|Orion-14B |<span style="background-color: #ffffe0;">74.2</span> |74.2 |88.2 |72.8 |94.1 |66.2 |49.9 |
|
117 |
+
|Orion-MOE8x7B|<span style="background-color: #ffffe0;">**82.9**</span> |<span style="background-color: #add8e6;">**91.8**</span> |<span style="background-color: #add8e6;">90.4</span> |<span style="background-color: #add8e6;">**90.5**</span> |<span style="background-color: #add8e6;">96.4</span> |<span style="background-color: #add8e6;">81.2</span> |<span style="background-color: #add8e6;">**47.4**</span> |
|
118 |
|
119 |
### 3.1.3. 小语种: 韩文
|
120 |
+
|Model|Average|HAE-RAE|KoBEST BoolQ|KoBEST COPA|KoBEST HellaSwag|KoBEST SentiNeg|KoBEST WiC|PAWS-ko|
|
121 |
+
|-----|-------|-------|------------|-----------|----------------|---------------|----------|-------|
|
122 |
+
|Mixtral-8x7B |<span style="background-color: #ffffe0;">60.7</span> |53.2 |78.6 |66.2 |56.6 |77.1 |49.4 |44.1 |
|
123 |
+
|Qwen1.5-32B |<span style="background-color: #ffffe0;">58.6</span> |46.4 |76.3 |60.4 |53.0 |78.3 |52.1 |43.4 |
|
124 |
+
|Qwen2.5-32B |<span style="background-color: #ffffe0;">71.4</span> |**70.7** |80.3 |76.7 |**61.2** |96.5 |**77.2** |37.1 |
|
125 |
+
|Orion-14B |<span style="background-color: #ffffe0;">67.7</span> |69.7 |80.6 |77.1 |58.2 |92.4 |51.2 |44.6 |
|
126 |
+
|Orion-MOE8x7B|<span style="background-color: #ffffe0;">**72.0**</span> |<span style="background-color: #add8e6;">65.2</span> |<span style="background-color: #add8e6;">**85.4**</span> |<span style="background-color: #add8e6;">**80.4**</span> |<span style="background-color: #add8e6;">56.0</span> |<span style="background-color: #add8e6;">**97.0**</span> |<span style="background-color: #add8e6;">73.6</span> |<span style="background-color: #add8e6;">**46.4**</span> |
|
127 |
|
128 |
### 3.1.4. 小语种: 阿拉伯语,德语,法语,西班牙语
|
129 |
+
| Language | Spanish | | French | | German | | Arabic | |
|
130 |
|----|----|----|----|----|----|----|----|----|
|
131 |
|**Model**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|
|
132 |
+
|Mixtral-8x7B |74.3 |54.8 |73.9 |55.9 |69.2 |52.4 |47.9 |36.3 |
|
133 |
+
|Qwen1.5-32B |70.5 |55.1 |68.9 |56.0 |63.8 |50.8 |50.1 |40.0 |
|
134 |
+
|Qwen2.5-32B |75.0 |65.3 |74.2 |62.7 |69.8 |61.8 |59.8 |52.9 |
|
135 |
+
|Orion-14B |62.0 |44.6 |60.2 |42.3 |54.7 |38.9 |42.3 |33.9 |
|
136 |
+
|Orion-MOE8x7B|<span style="background-color: #add8e6;">**87.4**</span> |<span style="background-color: #add8e6;">**70.1**</span> |<span style="background-color: #add8e6;">**85.6**</span> |<span style="background-color: #add8e6;">**68.8**</span> |<span style="background-color: #add8e6;">**80.6**</span> |<span style="background-color: #add8e6;">**63.5**</span> |<span style="background-color: #add8e6;">**69.4**</span> |<span style="background-color: #add8e6;">**54.3</span>** |
|
137 |
|
138 |
### 3.1.5. 泄漏检测结果
|
139 |
当大型语言模型的预训练数据包含特定数据集的内容时,该模型在该数据集上的表现可能会被人为提高,从而导致不准确的性能评估。为了解决这个问题,来自中国科学院深圳先进技术研究院和其他机构的研究人员提出了一种简单有效的数据泄露检测方法。该方法利用多选项的可互换性,通过打乱原始数据集中的选项生成派生数据。然后,使用模型计算派生数据集的对数概率分布,以检测原始数据集是否存在泄露。
|
|
|
142 |
更多细节可以在论文中找到:https://web3.arxiv.org/pdf/2409.01790。<br>
|
143 |
测试代码:https://github.com/nishiwen1214/Benchmark-leakage-detection。
|
144 |
|
145 |
+
|Threshold 0.2|Qwen2.5 32B|Qwen1.5 32B|Orion MOE8x7B|Orion 14B|Mixtral 8x7B|
|
146 |
|------|------|------|------|------|------|
|
147 |
+
|MMLU | 0.30 | 0.27 | <span style="background-color: #add8e6;">**0.22**</span> | 0.28 | 0.25 |
|
148 |
+
|CEval | 0.39 | 0.38 | <span style="background-color: #add8e6;">0.27</span> | **0.26** | **0.26** |
|
149 |
+
|CMMLU | 0.38 | 0.39 | <span style="background-color: #add8e6;">0.23</span> | 0.27 | **0.22** |
|
150 |
|
151 |
### 3.1.6. 推理速度
|
152 |
搭建基于8卡Nvidia RTX3090以及4卡Nvidia A100,采用"token/秒"为单位,从客户端统计测试结果。
|
153 |
|Models | 8x3090 1concurrent | 8x3090 4concurrent | 4xA100 1concurrent | 4xA100 4concurrent|
|
154 |
|---------|--------|-------|--------|-------|
|
155 |
+
|OrionMOE | <span style="background-color: #add8e6;">**102.77**</span> | <span style="background-color: #add8e6;">**54.61**</span> | <span style="background-color: #add8e6;">**107.76**</span> | <span style="background-color: #add8e6;">**61.83**</span> |
|
156 |
|Qwen32 | 52.93 | 46.06 | 62.43 | 56.81 |
|
157 |
|
158 |
<br>
|
159 |
同时测试了4卡A100上,基于不同输入长度(tokens)的推理速度比较,采用"token/秒"为单位,从客户端统计测试结果。
|
160 |
|
161 |
+
| Input | 4k | 8k | 12k | 16k | 32k | 64k |
|
162 |
|---------|-------|-------|-------|-------|-------|-------|
|
163 |
+
|OrionMOE | <span style="background-color: #add8e6;">**90.86**</span> | <span style="background-color: #add8e6;">**54.40**</span> | <span style="background-color: #add8e6;">**31.08**</span> | <span style="background-color: #add8e6;">**29.04**</span> | <span style="background-color: #add8e6;">**22.69**</span> | <span style="background-color: #add8e6;">**14.51**</span> |
|
164 |
|Qwen32 | 53.99 | 47.59 | 25.98 | 24.35 | 18.64 | 11.86 |
|
165 |
|
166 |
|
167 |
+
|
168 |
<a name="zh_model-inference"></a><br>
|
169 |
# 4. 模型推理
|
170 |
|
|
|
260 |
|
261 |
<div align="center">
|
262 |
<img src="./assets/imgs/wechat_group.jpg" alt="wechat" width="40%" />
|
263 |
+
</div>
|