renillhuang commited on
Commit
dc9ef78
1 Parent(s): 3e59fda

readme: Update tables with background color and bold best score

Browse files
Files changed (2) hide show
  1. README.md +48 -47
  2. README_zh.md +46 -48
README.md CHANGED
@@ -96,54 +96,54 @@ Model release and download links are provided in the table below:
96
 
97
  ## 3.1. Base Model Orion-MOE8x7B-Base Benchmarks
98
  ### 3.1.1. LLM evaluation results on examination and professional knowledge
99
- |TestSet|Mixtral 8x7B|Qwen1.5-32b|Qwen2.5-32b|Orion 14B|Orion 8x7B|
100
- | ----------- | ----- | ----- | ----- | ----- | ----- |
101
- |CEval | 54.09 | 83.50 | 87.74 | 72.80 | 89.74 |
102
- |CMMLU | 53.21 | 82.30 | 89.01 | 70.57 | 89.16 |
103
- |MMLU | 70.40 | 73.40 | 82.90 | 69.94 | 85.90 |
104
- |MMLU Pro | 38.50 | 45.25 | 58.01 | 33.95 | 58.31 |
105
- |ARC_c | 85.08 | 90.17 | 94.24 | 79.66 | 91.86 |
106
- |HellaSwag | 81.95 | 81.98 | 82.51 | 78.53 | 89.19 |
107
- |LAMBADA | 76.79 | 73.74 | 75.37 | 78.83 | 79.74 |
108
- |BBH | 50.87 | 57.28 | 67.69 | 50.35 | 55.82 |
109
- |MuSR | 43.21 | 42.65 | 49.78 | 43.61 | 49.93 |
110
- |PIQA | 83.41 | 82.15 | 80.05 | 79.54 | 87.32 |
111
- |CommonSenseQA| 69.62 | 74.69 | 72.97 | 66.91 | 73.05 |
112
- |IFEval | 24.15 | 32.97 | 41.59 | 29.08 | 30.06 |
113
- |GPQA | 30.90 | 33.49 | 49.50 | 28.53 | 52.17 |
114
- |HumanEval | 33.54 | 35.98 | 46.95 | 20.12 | 44.51 |
115
- |MBPP | 60.70 | 49.40 | 71.00 | 30.00 | 43.40 |
116
- |MATH Lv5 | 9.00 | 25.00 | 31.72 | 2.54 | 5.07 |
117
- |GSM8K | 47.50 | 77.40 | 80.36 | 52.01 | 59.82 |
118
- |MATH | 28.40 | 36.10 | 48.88 | 7.84 | 23.68 |
119
 
120
  ### 3.1.2. Comparison of LLM performances on Japanese testsets
121
- | Model | JSQuAD | JCommonSenseQA | JNLI | MARC-ja | JAQKET v2 | PAWS-ja | avg |
122
- |--------------|-------|-------|-------|-------|-------|-------|-------|
123
- |Mixtral-8x7B | 89.00 | 78.73 | 32.13 | 95.44 | 78.86 | 44.50 | 69.78 |
124
- |Qwen1.5-32B | 89.86 | 84.54 | 50.99 | 97.08 | 82.14 | 43.80 | 74.74 |
125
- |Qwen2.5-32B | 89.09 | 93.83 | 72.14 | 97.86 | 89.27 | 42.15 | 80.73 |
126
- |Orion-14B-Base| 74.22 | 88.20 | 72.85 | 94.06 | 66.20 | 49.90 | 74.24 |
127
- |Orion 8x7B | 91.77 | 90.43 | 90.46 | 96.40 | 81.19 | 47.35 | 82.93 |
128
 
129
  ### 3.1.3. Comparison of LLM performances on Korean testsets
130
- |Model | HAE-RAE | KoBEST BoolQ | KoBEST COPA | KoBEST HellaSwag | KoBEST SentiNeg | KoBEST WiC | PAWS-ko | avg |
131
- |--------------|-------|-------|-------|-------|-------|-------|-------|-------|
132
- |Mixtral-8x7B | 53.16 | 78.56 | 66.20 | 56.60 | 77.08 | 49.37 | 44.05 | 60.72 |
133
- |Qwen1.5-32B | 46.38 | 76.28 | 60.40 | 53.00 | 78.34 | 52.14 | 43.40 | 58.56 |
134
- |Qwen2.5-32B | 70.67 | 80.27 | 76.70 | 61.20 | 96.47 | 77.22 | 37.05 | 71.37 |
135
- |Orion-14B-Base| 69.66 | 80.63 | 77.10 | 58.20 | 92.44 | 51.19 | 44.55 | 67.68 |
136
- |Orion 8x7B | 65.17 | 85.40 | 80.40 | 56.00 | 96.98 | 73.57 | 46.35 | 71.98 |
 
137
 
138
  ### 3.1.4. Comparison of LLM performances on Arabic, German, French, and Spanish testsets
139
- | Lang | ar | | de | | fr | | es | |
140
  |----|----|----|----|----|----|----|----|----|
141
  |**Model**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|
142
- |Mixtral-8x7B | 53.16 | 78.56 | 66.20 | 56.60 | 77.08 | 49.37 | 44.05 | 60.72 |
143
- |Qwen1.5-32B | 46.38 | 76.28 | 60.40 | 53.00 | 78.34 | 52.14 | 43.40 | 58.56 |
144
- |Qwen2.5-32B | 70.67 | 80.27 | 76.70 | 61.20 | 96.47 | 77.22 | 37.05 | 71.37 |
145
- |Orion-14B-Base| 69.66 | 80.63 | 77.10 | 58.20 | 92.44 | 51.19 | 44.55 | 67.68 |
146
- |Orion 8x7B | 65.17 | 85.40 | 80.40 | 56.00 | 96.98 | 73.57 | 46.35 | 71.98 |
147
 
148
  ### 3.1.5. Leakage Detection Benchmark
149
  When the pre-training data of a large language model contains content from a specific dataset, the model’s performance on that dataset may be artificially enhanced, leading to inaccurate performance evaluations. To address this issue, researchers from the Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, and other institutions have proposed a simple and effective method for detecting data leakage. This method leverages the interchangeable nature of multiple-choice options by shuffling the options in the original dataset to generate derived data. The log-probability distribution of the derived dataset is then computed using the model to detect whether the original dataset has been leaked.
@@ -152,28 +152,29 @@ We conducted data leakage detection experiments on three benchmark datasets: MML
152
  More details can be found in the paper: https://web3.arxiv.org/pdf/2409.01790.<br>
153
  Test code: https://github.com/nishiwen1214/Benchmark-leakage-detection.
154
 
155
- |Threshold 0.2|Qwen2.5 32B|Qwen1.5 32B|Orion 8x7B|Orion 14B|Mixtral 8x7B|
156
  |------|------|------|------|------|------|
157
- |MMLU | 0.30 | 0.27 | 0.22 | 0.28 | 0.25 |
158
- |CEval | 0.39 | 0.38 | 0.27 | 0.26 | 0.26 |
159
- |CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
160
 
161
  ### 3.1.6. Inference speed
162
  Setup inference server on 8x Nvidia RTX3090, and get results from client in unit of 'tokens per second'.
163
  |Models | 8x3090 1concurrent | 8x3090 4concurrent | 4xA100 1concurrent | 4xA100 4concurrent|
164
  |---------|--------|-------|--------|-------|
165
- |OrionMOE | 102.77 | 54.61 | 107.76 | 61.83 |
166
  |Qwen32 | 52.93 | 46.06 | 62.43 | 56.81 |
167
 
168
  <br>
169
  We also tested on a 4x A100, comparing inference speeds based on different input lengths (tokens), get results from client in unit of 'tokens per second'.
170
 
171
- |input size | 4k | 8k | 12k | 16k | 32k | 64k |
172
  |---------|-------|-------|-------|-------|-------|-------|
173
- |OrionMOE | 90.86 | 54.40 | 31.08 | 29.04 | 22.69 | 14.51 |
174
  |Qwen32 | 53.99 | 47.59 | 25.98 | 24.35 | 18.64 | 11.86 |
175
 
176
 
 
177
  <a name="model-inference"></a><br>
178
  # 4. Model Inference
179
 
 
96
 
97
  ## 3.1. Base Model Orion-MOE8x7B-Base Benchmarks
98
  ### 3.1.1. LLM evaluation results on examination and professional knowledge
99
+ |TestSet|Mixtral 8x7B|Qwen1.5-32b|Qwen2.5-32b|Orion 14B|Orion MOE8x7B|
100
+ | -------------- | ---- | ---- | ---- | ---- | ---- |
101
+ | MMLU | 70.4 | 73.4 | 82.9 | 69.9 | <span style="background-color: #add8e6;">**85.9**</span> |
102
+ | MMLU Pro | 38.5 | 45.3 | 58.0 | 34.0 | <span style="background-color: #add8e6;">**58.3**</span> |
103
+ | CEval | 54.1 | 83.5 | 87.7 | 72.8 | <span style="background-color: #add8e6;">**89.7**</span> |
104
+ | CMMLU | 53.2 | 82.3 | 89.0 | 70.6 | <span style="background-color: #add8e6;">**89.2**</span> |
105
+ | ARC_c | 85.1 | 90.2 | **94.2** | 79.7 | <span style="background-color: #add8e6;">91.9</span> |
106
+ | HellaSwag | 81.9 | 82.0 | 82.5 | 78.5 | <span style="background-color: #add8e6;">**89.2**</span> |
107
+ | LAMBADA | 76.8 | 73.7 | 75.4 | 78.8 | <span style="background-color: #add8e6;">**79.7**</span> |
108
+ | BBH | 50.9 | 57.3 | **67.7** | 50.4 | <span style="background-color: #add8e6;">55.8</span> |
109
+ | MuSR | 43.2 | 42.7 | 49.8 | 43.6 | <span style="background-color: #add8e6;">**49.9**</span> |
110
+ | PIQA | 83.4 | 82.2 | 80.1 | 79.5 | <span style="background-color: #add8e6;">**87.3**</span> |
111
+ | CommonSenseQA | 69.6 | **74.7** | 73.0 | 66.9 | <span style="background-color: #add8e6;">73.1</span> |
112
+ | IFEval | 24.2 | 33.0 | **41.6** | 29.1 | <span style="background-color: #add8e6;">30.1</span> |
113
+ | GQPA | 30.9 | 33.5 | 49.5 | 28.5 | <span style="background-color: #add8e6;">**52.2**</span> |
114
+ | HumanEval | 33.5 | 36.0 | **47.0** | 20.1 | <span style="background-color: #add8e6;">44.5</span> |
115
+
116
+
117
+
 
118
 
119
  ### 3.1.2. Comparison of LLM performances on Japanese testsets
120
+ |Model |Average|JSQuAD|JCommonSenseQA|JNLI|MARC-ja|JAQKET v2|PAWS-ja|
121
+ |-------------|-------|-------|---------------|-----|-------|---------|-------|
122
+ |Mixtral-8x7B |<span style="background-color: #ffffe0;">69.8</span> |89.0 |78.7 |32.1 |95.4 |78.9 |44.5 |
123
+ |Qwen1.5-32B |<span style="background-color: #ffffe0;">74.7</span> |89.9 |84.5 |51.0 |97.1 |82.1 |43.8 |
124
+ |Qwen2.5-32B |<span style="background-color: #ffffe0;">80.7</span> |89.1 |93.8 |72.1 |**97.9** |**89.3** |42.2 |
125
+ |Orion-14B |<span style="background-color: #ffffe0;">74.2</span> |74.2 |88.2 |72.8 |94.1 |66.2 |49.9 |
126
+ |Orion-MOE8x7B|<span style="background-color: #ffffe0;">**82.9**</span> |<span style="background-color: #add8e6;">**91.8**</span> |<span style="background-color: #add8e6;">90.4</span> |<span style="background-color: #add8e6;">**90.5**</span> |<span style="background-color: #add8e6;">96.4</span> |<span style="background-color: #add8e6;">81.2</span> |<span style="background-color: #add8e6;">**47.4**</span> |
127
 
128
  ### 3.1.3. Comparison of LLM performances on Korean testsets
129
+ |Model|Average|HAE-RAE|KoBEST BoolQ|KoBEST COPA|KoBEST HellaSwag|KoBEST SentiNeg|KoBEST WiC|PAWS-ko|
130
+ |-----|-------|-------|------------|-----------|----------------|---------------|----------|-------|
131
+ |Mixtral-8x7B |<span style="background-color: #ffffe0;">60.7</span> |53.2 |78.6 |66.2 |56.6 |77.1 |49.4 |44.1 |
132
+ |Qwen1.5-32B |<span style="background-color: #ffffe0;">58.6</span> |46.4 |76.3 |60.4 |53.0 |78.3 |52.1 |43.4 |
133
+ |Qwen2.5-32B |<span style="background-color: #ffffe0;">71.4</span> |**70.7** |80.3 |76.7 |**61.2** |96.5 |**77.2** |37.1 |
134
+ |Orion-14B |<span style="background-color: #ffffe0;">67.7</span> |69.7 |80.6 |77.1 |58.2 |92.4 |51.2 |44.6 |
135
+ |Orion-MOE8x7B|<span style="background-color: #ffffe0;">**72.0**</span> |<span style="background-color: #add8e6;">65.2</span> |<span style="background-color: #add8e6;">**85.4**</span> |<span style="background-color: #add8e6;">**80.4**</span> |<span style="background-color: #add8e6;">56.0</span> |<span style="background-color: #add8e6;">**97.0**</span> |<span style="background-color: #add8e6;">73.6</span> |<span style="background-color: #add8e6;">**46.4**</span> |
136
+
137
 
138
  ### 3.1.4. Comparison of LLM performances on Arabic, German, French, and Spanish testsets
139
+ | Language | Spanish | | French | | German | | Arabic | |
140
  |----|----|----|----|----|----|----|----|----|
141
  |**Model**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|
142
+ |Mixtral-8x7B |74.3 |54.8 |73.9 |55.9 |69.2 |52.4 |47.9 |36.3 |
143
+ |Qwen1.5-32B |70.5 |55.1 |68.9 |56.0 |63.8 |50.8 |50.1 |40.0 |
144
+ |Qwen2.5-32B |75.0 |65.3 |74.2 |62.7 |69.8 |61.8 |59.8 |52.9 |
145
+ |Orion-14B |62.0 |44.6 |60.2 |42.3 |54.7 |38.9 |42.3 |33.9 |
146
+ |Orion-MOE8x7B|<span style="background-color: #add8e6;">**87.4**</span> |<span style="background-color: #add8e6;">**70.1**</span> |<span style="background-color: #add8e6;">**85.6**</span> |<span style="background-color: #add8e6;">**68.8**</span> |<span style="background-color: #add8e6;">**80.6**</span> |<span style="background-color: #add8e6;">**63.5**</span> |<span style="background-color: #add8e6;">**69.4**</span> |<span style="background-color: #add8e6;">**54.3</span>** |
147
 
148
  ### 3.1.5. Leakage Detection Benchmark
149
  When the pre-training data of a large language model contains content from a specific dataset, the model’s performance on that dataset may be artificially enhanced, leading to inaccurate performance evaluations. To address this issue, researchers from the Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, and other institutions have proposed a simple and effective method for detecting data leakage. This method leverages the interchangeable nature of multiple-choice options by shuffling the options in the original dataset to generate derived data. The log-probability distribution of the derived dataset is then computed using the model to detect whether the original dataset has been leaked.
 
152
  More details can be found in the paper: https://web3.arxiv.org/pdf/2409.01790.<br>
153
  Test code: https://github.com/nishiwen1214/Benchmark-leakage-detection.
154
 
155
+ |Threshold 0.2|Qwen2.5 32B|Qwen1.5 32B|Orion MOE8x7B|Orion 14B|Mixtral 8x7B|
156
  |------|------|------|------|------|------|
157
+ |MMLU | 0.30 | 0.27 | <span style="background-color: #add8e6;">**0.22**</span> | 0.28 | 0.25 |
158
+ |CEval | 0.39 | 0.38 | <span style="background-color: #add8e6;">0.27</span> | **0.26** | **0.26** |
159
+ |CMMLU | 0.38 | 0.39 | <span style="background-color: #add8e6;">0.23</span> | 0.27 | **0.22** |
160
 
161
  ### 3.1.6. Inference speed
162
  Setup inference server on 8x Nvidia RTX3090, and get results from client in unit of 'tokens per second'.
163
  |Models | 8x3090 1concurrent | 8x3090 4concurrent | 4xA100 1concurrent | 4xA100 4concurrent|
164
  |---------|--------|-------|--------|-------|
165
+ |OrionMOE | <span style="background-color: #add8e6;">**102.77**</span> | <span style="background-color: #add8e6;">**54.61**</span> | <span style="background-color: #add8e6;">**107.76**</span> | <span style="background-color: #add8e6;">**61.83**</span> |
166
  |Qwen32 | 52.93 | 46.06 | 62.43 | 56.81 |
167
 
168
  <br>
169
  We also tested on a 4x A100, comparing inference speeds based on different input lengths (tokens), get results from client in unit of 'tokens per second'.
170
 
171
+ | Input | 4k | 8k | 12k | 16k | 32k | 64k |
172
  |---------|-------|-------|-------|-------|-------|-------|
173
+ |OrionMOE | <span style="background-color: #add8e6;">**90.86**</span> | <span style="background-color: #add8e6;">**54.40**</span> | <span style="background-color: #add8e6;">**31.08**</span> | <span style="background-color: #add8e6;">**29.04**</span> | <span style="background-color: #add8e6;">**22.69**</span> | <span style="background-color: #add8e6;">**14.51**</span> |
174
  |Qwen32 | 53.99 | 47.59 | 25.98 | 24.35 | 18.64 | 11.86 |
175
 
176
 
177
+
178
  <a name="model-inference"></a><br>
179
  # 4. Model Inference
180
 
README_zh.md CHANGED
@@ -89,54 +89,51 @@
89
  ## 3.1. 基座模型Orion-MOE8x7B-Base评估
90
 
91
  ### 3.1.1. 基座模型基准测试对比
92
- |TestSet|Mixtral 8x7B|Qwen1.5-32b|Qwen2.5-32b|Orion 14B|Orion 8x7B|
93
- | ----------- | ----- | ----- | ----- | ----- | ----- |
94
- |CEval | 54.09 | 83.50 | 87.74 | 72.80 | 89.74 |
95
- |CMMLU | 53.21 | 82.30 | 89.01 | 70.57 | 89.16 |
96
- |MMLU | 70.40 | 73.40 | 82.90 | 69.94 | 85.90 |
97
- |MMLU Pro | 38.50 | 45.25 | 58.01 | 33.95 | 58.31 |
98
- |ARC_c | 85.08 | 90.17 | 94.24 | 79.66 | 91.86 |
99
- |HellaSwag | 81.95 | 81.98 | 82.51 | 78.53 | 89.19 |
100
- |LAMBADA | 76.79 | 73.74 | 75.37 | 78.83 | 79.74 |
101
- |BBH | 50.87 | 57.28 | 67.69 | 50.35 | 55.82 |
102
- |MuSR | 43.21 | 42.65 | 49.78 | 43.61 | 49.93 |
103
- |PIQA | 83.41 | 82.15 | 80.05 | 79.54 | 87.32 |
104
- |CommonSenseQA| 69.62 | 74.69 | 72.97 | 66.91 | 73.05 |
105
- |IFEval | 24.15 | 32.97 | 41.59 | 29.08 | 30.06 |
106
- |GPQA | 30.90 | 33.49 | 49.50 | 28.53 | 52.17 |
107
- |HumanEval | 33.54 | 35.98 | 46.95 | 20.12 | 44.51 |
108
- |MBPP | 60.70 | 49.40 | 71.00 | 30.00 | 43.40 |
109
- |MATH Lv5 | 9.00 | 25.00 | 31.72 | 2.54 | 5.07 |
110
- |GSM8K | 47.50 | 77.40 | 80.36 | 52.01 | 59.82 |
111
- |MATH | 28.40 | 36.10 | 48.88 | 7.84 | 23.68 |
112
 
113
  ### 3.1.2. 小语种: 日文
114
- | Model | JSQuAD | JCommonSenseQA | JNLI | MARC-ja | JAQKET v2 | PAWS-ja | avg |
115
- |--------------|-------|-------|-------|-------|-------|-------|-------|
116
- |Mixtral-8x7B | 89.00 | 78.73 | 32.13 | 95.44 | 78.86 | 44.50 | 69.78 |
117
- |Qwen1.5-32B | 89.86 | 84.54 | 50.99 | 97.08 | 82.14 | 43.80 | 74.74 |
118
- |Qwen2.5-32B | 89.09 | 93.83 | 72.14 | 97.86 | 89.27 | 42.15 | 80.73 |
119
- |Orion-14B-Base| 74.22 | 88.20 | 72.85 | 94.06 | 66.20 | 49.90 | 74.24 |
120
- |Orion 8x7B | 91.77 | 90.43 | 90.46 | 96.40 | 81.19 | 47.35 | 82.93 |
121
 
122
  ### 3.1.3. 小语种: 韩文
123
- |Model | HAE-RAE | KoBEST BoolQ | KoBEST COPA | KoBEST HellaSwag | KoBEST SentiNeg | KoBEST WiC | PAWS-ko | avg |
124
- |--------------|-------|-------|-------|-------|-------|-------|-------|-------|
125
- |Mixtral-8x7B | 53.16 | 78.56 | 66.20 | 56.60 | 77.08 | 49.37 | 44.05 | 60.72 |
126
- |Qwen1.5-32B | 46.38 | 76.28 | 60.40 | 53.00 | 78.34 | 52.14 | 43.40 | 58.56 |
127
- |Qwen2.5-32B | 70.67 | 80.27 | 76.70 | 61.20 | 96.47 | 77.22 | 37.05 | 71.37 |
128
- |Orion-14B-Base| 69.66 | 80.63 | 77.10 | 58.20 | 92.44 | 51.19 | 44.55 | 67.68 |
129
- |Orion 8x7B | 65.17 | 85.40 | 80.40 | 56.00 | 96.98 | 73.57 | 46.35 | 71.98 |
130
 
131
  ### 3.1.4. 小语种: 阿拉伯语,德语,法语,西班牙语
132
- | Lang | ar | | de | | fr | | es | |
133
  |----|----|----|----|----|----|----|----|----|
134
  |**Model**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|
135
- |Mixtral-8x7B | 53.16 | 78.56 | 66.20 | 56.60 | 77.08 | 49.37 | 44.05 | 60.72 |
136
- |Qwen1.5-32B | 46.38 | 76.28 | 60.40 | 53.00 | 78.34 | 52.14 | 43.40 | 58.56 |
137
- |Qwen2.5-32B | 70.67 | 80.27 | 76.70 | 61.20 | 96.47 | 77.22 | 37.05 | 71.37 |
138
- |Orion-14B-Base| 69.66 | 80.63 | 77.10 | 58.20 | 92.44 | 51.19 | 44.55 | 67.68 |
139
- |Orion 8x7B | 65.17 | 85.40 | 80.40 | 56.00 | 96.98 | 73.57 | 46.35 | 71.98 |
140
 
141
  ### 3.1.5. 泄漏检测结果
142
  当大型语言模型的预训练数据包含特定数据集的内容时,该模型在该数据集上的表现可能会被人为提高,从而导致不准确的性能评估。为了解决这个问题,来自中国科学院深圳先进技术研究院和其他机构的研究人员提出了一种简单有效的数据泄露检测方法。该方法利用多选项的可互换性,通过打乱原始数据集中的选项生成派生数据。然后,使用模型计算派生数据集的对数概率分布,以检测原始数据集是否存在泄露。
@@ -145,28 +142,29 @@
145
  更多细节可以在论文中找到:https://web3.arxiv.org/pdf/2409.01790。<br>
146
  测试代码:https://github.com/nishiwen1214/Benchmark-leakage-detection。
147
 
148
- |Threshold 0.2|Qwen2.5 32B|Qwen1.5 32B|Orion 8x7B|Orion 14B|Mixtral 8x7B|
149
  |------|------|------|------|------|------|
150
- |MMLU | 0.30 | 0.27 | 0.22 | 0.28 | 0.25 |
151
- |CEval | 0.39 | 0.38 | 0.27 | 0.26 | 0.26 |
152
- |CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
153
 
154
  ### 3.1.6. 推理速度
155
  搭建基于8卡Nvidia RTX3090以及4卡Nvidia A100,采用"token/秒"为单位,从客户端统计测试结果。
156
  |Models | 8x3090 1concurrent | 8x3090 4concurrent | 4xA100 1concurrent | 4xA100 4concurrent|
157
  |---------|--------|-------|--------|-------|
158
- |OrionMOE | 102.77 | 54.61 | 107.76 | 61.83 |
159
  |Qwen32 | 52.93 | 46.06 | 62.43 | 56.81 |
160
 
161
  <br>
162
  同时测试了4卡A100上,基于不同输入长度(tokens)的推理速度比较,采用"token/秒"为单位,从客户端统计测试结果。
163
 
164
- |input size | 4k | 8k | 12k | 16k | 32k | 64k |
165
  |---------|-------|-------|-------|-------|-------|-------|
166
- |OrionMOE | 90.86 | 54.40 | 31.08 | 29.04 | 22.69 | 14.51 |
167
  |Qwen32 | 53.99 | 47.59 | 25.98 | 24.35 | 18.64 | 11.86 |
168
 
169
 
 
170
  <a name="zh_model-inference"></a><br>
171
  # 4. 模型推理
172
 
@@ -262,4 +260,4 @@ Engineering 、Agent开发的全链条能力和经验积累;拥有完整的端
262
 
263
  <div align="center">
264
  <img src="./assets/imgs/wechat_group.jpg" alt="wechat" width="40%" />
265
- </div>
 
89
  ## 3.1. 基座模型Orion-MOE8x7B-Base评估
90
 
91
  ### 3.1.1. 基座模型基准测试对比
92
+ |TestSet|Mixtral 8x7B|Qwen1.5-32b|Qwen2.5-32b|Orion 14B|Orion MOE8x7B|
93
+ | -------------- | ---- | ---- | ---- | ---- | ---- |
94
+ | MMLU | 70.4 | 73.4 | 82.9 | 69.9 | <span style="background-color: #add8e6;">**85.9**</span> |
95
+ | MMLU Pro | 38.5 | 45.3 | 58.0 | 34.0 | <span style="background-color: #add8e6;">**58.3**</span> |
96
+ | CEval | 54.1 | 83.5 | 87.7 | 72.8 | <span style="background-color: #add8e6;">**89.7**</span> |
97
+ | CMMLU | 53.2 | 82.3 | 89.0 | 70.6 | <span style="background-color: #add8e6;">**89.2**</span> |
98
+ | ARC_c | 85.1 | 90.2 | **94.2** | 79.7 | <span style="background-color: #add8e6;">91.9</span> |
99
+ | HellaSwag | 81.9 | 82.0 | 82.5 | 78.5 | <span style="background-color: #add8e6;">**89.2**</span> |
100
+ | LAMBADA | 76.8 | 73.7 | 75.4 | 78.8 | <span style="background-color: #add8e6;">**79.7**</span> |
101
+ | BBH | 50.9 | 57.3 | **67.7** | 50.4 | <span style="background-color: #add8e6;">55.8</span> |
102
+ | MuSR | 43.2 | 42.7 | 49.8 | 43.6 | <span style="background-color: #add8e6;">**49.9**</span> |
103
+ | PIQA | 83.4 | 82.2 | 80.1 | 79.5 | <span style="background-color: #add8e6;">**87.3**</span> |
104
+ | CommonSenseQA | 69.6 | **74.7** | 73.0 | 66.9 | <span style="background-color: #add8e6;">73.1</span> |
105
+ | IFEval | 24.2 | 33.0 | **41.6** | 29.1 | <span style="background-color: #add8e6;">30.1</span> |
106
+ | GQPA | 30.9 | 33.5 | 49.5 | 28.5 | <span style="background-color: #add8e6;">**52.2**</span> |
107
+ | HumanEval | 33.5 | 36.0 | **47.0** | 20.1 | <span style="background-color: #add8e6;">44.5</span> |
108
+
 
 
 
109
 
110
  ### 3.1.2. 小语种: 日文
111
+ |Model |Average|JSQuAD|JCommonSenseQA|JNLI|MARC-ja|JAQKET v2|PAWS-ja|
112
+ |-------------|-------|-------|---------------|-----|-------|---------|-------|
113
+ |Mixtral-8x7B |<span style="background-color: #ffffe0;">69.8</span> |89.0 |78.7 |32.1 |95.4 |78.9 |44.5 |
114
+ |Qwen1.5-32B |<span style="background-color: #ffffe0;">74.7</span> |89.9 |84.5 |51.0 |97.1 |82.1 |43.8 |
115
+ |Qwen2.5-32B |<span style="background-color: #ffffe0;">80.7</span> |89.1 |93.8 |72.1 |**97.9** |**89.3** |42.2 |
116
+ |Orion-14B |<span style="background-color: #ffffe0;">74.2</span> |74.2 |88.2 |72.8 |94.1 |66.2 |49.9 |
117
+ |Orion-MOE8x7B|<span style="background-color: #ffffe0;">**82.9**</span> |<span style="background-color: #add8e6;">**91.8**</span> |<span style="background-color: #add8e6;">90.4</span> |<span style="background-color: #add8e6;">**90.5**</span> |<span style="background-color: #add8e6;">96.4</span> |<span style="background-color: #add8e6;">81.2</span> |<span style="background-color: #add8e6;">**47.4**</span> |
118
 
119
  ### 3.1.3. 小语种: 韩文
120
+ |Model|Average|HAE-RAE|KoBEST BoolQ|KoBEST COPA|KoBEST HellaSwag|KoBEST SentiNeg|KoBEST WiC|PAWS-ko|
121
+ |-----|-------|-------|------------|-----------|----------------|---------------|----------|-------|
122
+ |Mixtral-8x7B |<span style="background-color: #ffffe0;">60.7</span> |53.2 |78.6 |66.2 |56.6 |77.1 |49.4 |44.1 |
123
+ |Qwen1.5-32B |<span style="background-color: #ffffe0;">58.6</span> |46.4 |76.3 |60.4 |53.0 |78.3 |52.1 |43.4 |
124
+ |Qwen2.5-32B |<span style="background-color: #ffffe0;">71.4</span> |**70.7** |80.3 |76.7 |**61.2** |96.5 |**77.2** |37.1 |
125
+ |Orion-14B |<span style="background-color: #ffffe0;">67.7</span> |69.7 |80.6 |77.1 |58.2 |92.4 |51.2 |44.6 |
126
+ |Orion-MOE8x7B|<span style="background-color: #ffffe0;">**72.0**</span> |<span style="background-color: #add8e6;">65.2</span> |<span style="background-color: #add8e6;">**85.4**</span> |<span style="background-color: #add8e6;">**80.4**</span> |<span style="background-color: #add8e6;">56.0</span> |<span style="background-color: #add8e6;">**97.0**</span> |<span style="background-color: #add8e6;">73.6</span> |<span style="background-color: #add8e6;">**46.4**</span> |
127
 
128
  ### 3.1.4. 小语种: 阿拉伯语,德语,法语,西班牙语
129
+ | Language | Spanish | | French | | German | | Arabic | |
130
  |----|----|----|----|----|----|----|----|----|
131
  |**Model**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|
132
+ |Mixtral-8x7B |74.3 |54.8 |73.9 |55.9 |69.2 |52.4 |47.9 |36.3 |
133
+ |Qwen1.5-32B |70.5 |55.1 |68.9 |56.0 |63.8 |50.8 |50.1 |40.0 |
134
+ |Qwen2.5-32B |75.0 |65.3 |74.2 |62.7 |69.8 |61.8 |59.8 |52.9 |
135
+ |Orion-14B |62.0 |44.6 |60.2 |42.3 |54.7 |38.9 |42.3 |33.9 |
136
+ |Orion-MOE8x7B|<span style="background-color: #add8e6;">**87.4**</span> |<span style="background-color: #add8e6;">**70.1**</span> |<span style="background-color: #add8e6;">**85.6**</span> |<span style="background-color: #add8e6;">**68.8**</span> |<span style="background-color: #add8e6;">**80.6**</span> |<span style="background-color: #add8e6;">**63.5**</span> |<span style="background-color: #add8e6;">**69.4**</span> |<span style="background-color: #add8e6;">**54.3</span>** |
137
 
138
  ### 3.1.5. 泄漏检测结果
139
  当大型语言模型的预训练数据包含特定数据集的内容时,该模型在该数据集上的表现可能会被人为提高,从而导致不准确的性能评估。为了解决这个问题,来自中国科学院深圳先进技术研究院和其他机构的研究人员提出了一种简单有效的数据泄露检测方法。该方法利用多选项的可互换性,通过打乱原始数据集中的选项生成派生数据。然后,使用模型计算派生数据集的对数概率分布,以检测原始数据集是否存在泄露。
 
142
  更多细节可以在论文中找到:https://web3.arxiv.org/pdf/2409.01790。<br>
143
  测试代码:https://github.com/nishiwen1214/Benchmark-leakage-detection。
144
 
145
+ |Threshold 0.2|Qwen2.5 32B|Qwen1.5 32B|Orion MOE8x7B|Orion 14B|Mixtral 8x7B|
146
  |------|------|------|------|------|------|
147
+ |MMLU | 0.30 | 0.27 | <span style="background-color: #add8e6;">**0.22**</span> | 0.28 | 0.25 |
148
+ |CEval | 0.39 | 0.38 | <span style="background-color: #add8e6;">0.27</span> | **0.26** | **0.26** |
149
+ |CMMLU | 0.38 | 0.39 | <span style="background-color: #add8e6;">0.23</span> | 0.27 | **0.22** |
150
 
151
  ### 3.1.6. 推理速度
152
  搭建基于8卡Nvidia RTX3090以及4卡Nvidia A100,采用"token/秒"为单位,从客户端统计测试结果。
153
  |Models | 8x3090 1concurrent | 8x3090 4concurrent | 4xA100 1concurrent | 4xA100 4concurrent|
154
  |---------|--------|-------|--------|-------|
155
+ |OrionMOE | <span style="background-color: #add8e6;">**102.77**</span> | <span style="background-color: #add8e6;">**54.61**</span> | <span style="background-color: #add8e6;">**107.76**</span> | <span style="background-color: #add8e6;">**61.83**</span> |
156
  |Qwen32 | 52.93 | 46.06 | 62.43 | 56.81 |
157
 
158
  <br>
159
  同时测试了4卡A100上,基于不同输入长度(tokens)的推理速度比较,采用"token/秒"为单位,从客户端统计测试结果。
160
 
161
+ | Input | 4k | 8k | 12k | 16k | 32k | 64k |
162
  |---------|-------|-------|-------|-------|-------|-------|
163
+ |OrionMOE | <span style="background-color: #add8e6;">**90.86**</span> | <span style="background-color: #add8e6;">**54.40**</span> | <span style="background-color: #add8e6;">**31.08**</span> | <span style="background-color: #add8e6;">**29.04**</span> | <span style="background-color: #add8e6;">**22.69**</span> | <span style="background-color: #add8e6;">**14.51**</span> |
164
  |Qwen32 | 53.99 | 47.59 | 25.98 | 24.35 | 18.64 | 11.86 |
165
 
166
 
167
+
168
  <a name="zh_model-inference"></a><br>
169
  # 4. 模型推理
170
 
 
260
 
261
  <div align="center">
262
  <img src="./assets/imgs/wechat_group.jpg" alt="wechat" width="40%" />
263
+ </div>