renillhuang commited on
Commit
4871737
1 Parent(s): 301ba04

readme: Modify benchmark tables

Browse files

Signed-off-by: eric <[email protected]>

Files changed (2) hide show
  1. README.md +27 -27
  2. README_zh.md +29 -27
README.md CHANGED
@@ -68,25 +68,26 @@ Model release and download links are provided in the table below:
68
 
69
  ## 3.1. Base Model Orion-MOE8x7B-Base Benchmarks
70
  ### 3.1.1. LLM evaluation results on examination and professional knowledge
71
- | Model | ceval | cmmlu | mmlu | mmlu_pro | ARC_c | hellaswag |
72
- |-------|---------|--------|-------|----------|-------|-----------|
73
- |Mixtral 8x7B | 54.0861 | 53.21 | 70.4000 | 38.5000 | 85.0847 | 81.9458 |
74
- |Qwen1.5-32b | 83.5000 | 82.3000 | 73.4000 | 45.2500 | 90.1695 | 81.9757 |
75
- |Qwen2.5-32b | 87.7414 | 89.0088 | 82.9000 | 58.0100 | 94.2373 | 82.5134 |
76
- |Orion 14B | 72.8000 | 70.5700 | 69.9400 | 33.9500 | 79.6600 | 78.5300 |
77
- |<span style="color: red;">Orion 8x7B | <span style="color: red;">89.7400 | <span style="color: red;">89.1555 | <span style="color: red;">85.9000 | <span style="color: red;">58.3100 | <span style="color: red;">91.8644 | <span style="color: red;">89.19 |
78
- |**Model**|**lambada**|**bbh**|**musr**|**piqa**|**commonsense_qa**|**IFEval**|
79
- |Mixtral 8x7B | 76.7902 | 50.87 | 43.21 | 83.41 | 69.62 | 24.15 |
80
- |Qwen1.5-32b | 73.7434 | 57.2800 | 42.6500 | 82.1500 | 74.6900 | 32.9700 |
81
- |Qwen2.5-32b | 75.3736 | 67.6900 | 49.7800 | 80.0500 | 72.9700 | 41.5900 |
82
- |Orion 14B | 78.8300 | 50.3500 | 43.6100 | 79.5400 | 66.9100 | 29.0800 |
83
- |<span style="color: red;">Orion 8x7B |<span style="color: red;">79.7399|<span style="color: red;">55.82 |<span style="color: red;">49.93 |<span style="color: red;">87.32 |<span style="color: red;">73.05 |<span style="color: red;">30.06 |
84
- |**Model**|**GQPA**|**human-eval**|**MBPP**|**math_lv5**|**gsm8k**|**math**|
85
- |Mixtral 8x7B | 30.9000 | 33.5366 | 60.7000 | 9.0000 | 47.5000 | 28.4000 |
86
- |Qwen1.5-32b | 33.4900 | 35.9756 | 49.4000 | 25.0000 | 77.4000 | 36.1000 |
87
- |Qwen2.5-32b | 49.5000 | 46.9512 | 71.0000 | 31.7200 | 80.3630 | 48.8800 |
88
- |Orion 14B | 28.5300 | 20.1200 | 30.0000 | 2.5400 | 52.0100 | 7.8400 |
89
- |<span style="color: red;">Orion 8x7B |<span style="color: red;">52.1700 |<span style="color: red;">44.5122 |<span style="color: red;">43.4 |<span style="color: red;">5.07 |<span style="color: red;">59.8200 |<span style="color: red;">23.6800 |
 
90
 
91
  ### 3.1.2. Comparison of LLM performances on Japanese testsets
92
  | Model | jsquad | jcommonsenseqa | jnli | marc_ja | jaqket_v2 | paws_ja | avg |
@@ -95,7 +96,7 @@ Model release and download links are provided in the table below:
95
  |Qwen1.5-32B | 0.8986 | 0.8454 | 0.5099 | 0.9708 | 0.8214 | 0.4380 | 0.7474 |
96
  |Qwen2.5-32B | 0.8909 | 0.9383 | 0.7214 | 0.9786 | 0.8927 | 0.4215 | 0.8073 |
97
  |Orion-14B-Base | 0.7422 | 0.8820 | 0.7285 | 0.9406 | 0.6620 | 0.4990 | 0.7424 |
98
- |<span style="color: red;">Orion 8x7B |<span style="color: red;">0.9177 |<span style="color: red;">0.9043 |<span style="color: red;">0.9046 |<span style="color: red;">0.9640 |<span style="color: red;">0.8119 |<span style="color: red;">0.4735 |<span style="color: red;">0.8293 |
99
 
100
  ### 3.1.3. Comparison of LLM performances on Korean testsets
101
  |Model | haerae | kobest boolq | kobest copa | kobest hellaswag | kobest sentineg | kobest wic | paws_ko | avg |
@@ -104,7 +105,7 @@ Model release and download links are provided in the table below:
104
  |Qwen1.5-32B | 46.38 | 76.28 | 60.4 | 53 | 78.34 | 52.14 | 43.4 | 58.56285714 |
105
  |Qwen2.5-32B | 70.67 | 80.27 | 76.7 | 61.2 | 96.47 | 77.22 | 37.05 | 71.36857143 |
106
  |Orion-14B-Base | 69.66 | 80.63 | 77.1 | 58.2 | 92.44 | 51.19 | 44.55 | 67.68142857 |
107
- |<span style="color: red;">Orion 8x7B |<span style="color: red;">65.17 |<span style="color: red;">85.4 |<span style="color: red;">80.4 |<span style="color: red;">56 |<span style="color: red;">96.98 |<span style="color: red;">73.57 |<span style="color: red;">46.35 |<span style="color: red;">71.98142857 |
108
 
109
  ### 3.1.4. Comparison of LLM performances on Arabic, German, French, and Spanish testsets
110
  | Lang | ar | | de | | fr | | es | |
@@ -114,21 +115,20 @@ Model release and download links are provided in the table below:
114
  |Qwen1.5-32B | 50.07 | 39.95 | 63.77 | 50.81 | 68.86 | 55.95 | 70.5 | 55.13 |
115
  |Qwen2.5-32B | 59.76 | 52.87 | 69.82 | 61.76 | 74.15 | 62.7 | 75.04 | 65.3 |
116
  |Orion-14B-Base | 42.26 | 33.88 | 54.65 | 38.92 | 60.21 | 42.34 | 62 | 44.62 |
117
- |<span style="color: red;">Orion 8x7B |<span style="color: red;">69.39 |<span style="color: red;">54.32 |<span style="color: red;">80.6 |<span style="color: red;">63.47 |<span style="color: red;">85.56 |<span style="color: red;">68.78 |<span style="color: red;">87.41 |<span style="color: red;">70.09 |
118
 
119
  ### 3.1.5. Leakage Detection Benchmark
120
  The proportion of leakage data(from various evaluation benchmarks) in the pre-trained corpus; the higher the proportion, the more leakage it indicates.
121
  - Code: https://github.com/nishiwen1214/Benchmark-leakage-detection
122
  - Paper: https://web3.arxiv.org/pdf/2409.01790
123
- - Blog: https://mp.weixin.qq.com/s/BtcJmDEUyzAYG-fqCal2lA
124
  - English Test: mmlu
125
  - Chinese Test: ceval, cmmlu
126
 
127
  |Threshold 0.2 | qwen2.5 32b | qwen1.5 32b | orion 8x7b | orion 14b | mixtral 8x7b |
128
- |----|----|----|----|----|----|
129
- |mmlu | 0.3 | 0.27 |<span style="color: red;">0.22 | 0.28 | 0.25 |
130
- |ceval | 0.39 | 0.38 |<span style="color: red;">0.27 | 0.26 | 0.26 |
131
- |cmmlu | 0.38 | 0.39 |<span style="color: red;">0.23 | 0.27 | 0.22 |
132
 
133
  ### 3.1.6. Inference speed
134
  Based on 8x Nvidia RTX3090, in unit of tokens per second.
 
68
 
69
  ## 3.1. Base Model Orion-MOE8x7B-Base Benchmarks
70
  ### 3.1.1. LLM evaluation results on examination and professional knowledge
71
+ |TestSet | Mixtral 8*7B | Qwen1.5-32b | Qwen2.5-32b | Orion 14B | Orion 8*7B|
72
+ | -- | -- | -- | -- | -- | -- |
73
+ |ceval | 54.0861 | 83.5 | 87.7414 | 72.8 | 89.74|
74
+ |cmmlu | 53.21 | 82.3 | 89.0088 | 70.57 | 89.1555|
75
+ |mmlu | 70.4 | 73.4 | 82.9 | 69.94 | 85.9|
76
+ |mmlu_pro | 38.5 | 45.25 | 58.01 | 33.95 | 58.31|
77
+ |ARC_c | 85.0847 | 90.1695 | 94.2373 | 79.66 | 91.8644|
78
+ |hellaswag | 81.9458 | 81.9757 | 82.5134 | 78.53 | 89.19|
79
+ |lambada | 76.7902 | 73.7434 | 75.3736 | 78.83 | 79.7399|
80
+ |bbh | 50.87 | 57.28 | 67.69 | 50.35 | 55.82|
81
+ |musr | 43.21 | 42.65 | 49.78 | 43.61 | 49.93|
82
+ |piqa | 83.41 | 82.15 | 80.05 | 79.54 | 87.32|
83
+ |commonsense_qa | 69.62 | 74.69 | 72.97 | 66.91 | 73.05|
84
+ |IFEval | 24.15 | 32.97 | 41.59 | 29.08 | 30.06|
85
+ |GQPA | 30.9 | 33.49 | 49.5 | 28.53 | 52.17|
86
+ |human-eval | 33.5366 | 35.9756 | 46.9512 | 20.12 | 44.5122|
87
+ |MBPP | 60.7 | 49.4 | 71 | 30 | 43.4|
88
+ |math lv5 | 9 | 25 | 31.72 | 2.54 | 5.07|
89
+ |gsm8k | 47.5 | 77.4 | 80.363 | 52.01 | 59.82|
90
+ |math | 28.4 | 36.1 | 48.88 | 7.84 | 23.68|
91
 
92
  ### 3.1.2. Comparison of LLM performances on Japanese testsets
93
  | Model | jsquad | jcommonsenseqa | jnli | marc_ja | jaqket_v2 | paws_ja | avg |
 
96
  |Qwen1.5-32B | 0.8986 | 0.8454 | 0.5099 | 0.9708 | 0.8214 | 0.4380 | 0.7474 |
97
  |Qwen2.5-32B | 0.8909 | 0.9383 | 0.7214 | 0.9786 | 0.8927 | 0.4215 | 0.8073 |
98
  |Orion-14B-Base | 0.7422 | 0.8820 | 0.7285 | 0.9406 | 0.6620 | 0.4990 | 0.7424 |
99
+ |Orion 8x7B |0.9177 |0.9043 |0.9046 |0.9640 |0.8119 |0.4735 |0.8293 |
100
 
101
  ### 3.1.3. Comparison of LLM performances on Korean testsets
102
  |Model | haerae | kobest boolq | kobest copa | kobest hellaswag | kobest sentineg | kobest wic | paws_ko | avg |
 
105
  |Qwen1.5-32B | 46.38 | 76.28 | 60.4 | 53 | 78.34 | 52.14 | 43.4 | 58.56285714 |
106
  |Qwen2.5-32B | 70.67 | 80.27 | 76.7 | 61.2 | 96.47 | 77.22 | 37.05 | 71.36857143 |
107
  |Orion-14B-Base | 69.66 | 80.63 | 77.1 | 58.2 | 92.44 | 51.19 | 44.55 | 67.68142857 |
108
+ |Orion 8x7B |65.17 |85.4 |80.4 |56 |96.98 |73.57 |46.35 |71.98142857 |
109
 
110
  ### 3.1.4. Comparison of LLM performances on Arabic, German, French, and Spanish testsets
111
  | Lang | ar | | de | | fr | | es | |
 
115
  |Qwen1.5-32B | 50.07 | 39.95 | 63.77 | 50.81 | 68.86 | 55.95 | 70.5 | 55.13 |
116
  |Qwen2.5-32B | 59.76 | 52.87 | 69.82 | 61.76 | 74.15 | 62.7 | 75.04 | 65.3 |
117
  |Orion-14B-Base | 42.26 | 33.88 | 54.65 | 38.92 | 60.21 | 42.34 | 62 | 44.62 |
118
+ |Orion 8x7B |69.39 |54.32 |80.6 |63.47 |85.56 |68.78 |87.41 |70.09 |
119
 
120
  ### 3.1.5. Leakage Detection Benchmark
121
  The proportion of leakage data(from various evaluation benchmarks) in the pre-trained corpus; the higher the proportion, the more leakage it indicates.
122
  - Code: https://github.com/nishiwen1214/Benchmark-leakage-detection
123
  - Paper: https://web3.arxiv.org/pdf/2409.01790
 
124
  - English Test: mmlu
125
  - Chinese Test: ceval, cmmlu
126
 
127
  |Threshold 0.2 | qwen2.5 32b | qwen1.5 32b | orion 8x7b | orion 14b | mixtral 8x7b |
128
+ |------|------|------|------|------|------|
129
+ |mmlu | 0.3 | 0.27 | 0.22 | 0.28 | 0.25 |
130
+ |ceval | 0.39 | 0.38 | 0.27 | 0.26 | 0.26 |
131
+ |cmmlu | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
132
 
133
  ### 3.1.6. Inference speed
134
  Based on 8x Nvidia RTX3090, in unit of tokens per second.
README_zh.md CHANGED
@@ -65,25 +65,27 @@
65
  ## 3.1. 基座模型Orion-MOE8x7B-Base评估
66
 
67
  ### 3.1.1. 基座模型基准测试对比
68
- | Model | ceval | cmmlu | mmlu | mmlu_pro | ARC_c | hellaswag |
69
- |-------|---------|--------|-------|----------|-------|-----------|
70
- |Mixtral 8x7B | 54.0861 | 53.21 | 70.4000 | 38.5000 | 85.0847 | 81.9458 |
71
- |Qwen1.5-32b | 83.5000 | 82.3000 | 73.4000 | 45.2500 | 90.1695 | 81.9757 |
72
- |Qwen2.5-32b | 87.7414 | 89.0088 | 82.9000 | 58.0100 | 94.2373 | 82.5134 |
73
- |Orion 14B | 72.8000 | 70.5700 | 69.9400 | 33.9500 | 79.6600 | 78.5300 |
74
- |<span style="color: red;">Orion 8x7B | <span style="color: red;">89.7400 | <span style="color: red;">89.1555 | <span style="color: red;">85.9000 | <span style="color: red;">58.3100 | <span style="color: red;">91.8644 | <span style="color: red;">89.19 |
75
- |**Model**|**lambada**|**bbh**|**musr**|**piqa**|**commonsense_qa**|**IFEval**|
76
- |Mixtral 8x7B | 76.7902 | 50.87 | 43.21 | 83.41 | 69.62 | 24.15 |
77
- |Qwen1.5-32b | 73.7434 | 57.2800 | 42.6500 | 82.1500 | 74.6900 | 32.9700 |
78
- |Qwen2.5-32b | 75.3736 | 67.6900 | 49.7800 | 80.0500 | 72.9700 | 41.5900 |
79
- |Orion 14B | 78.8300 | 50.3500 | 43.6100 | 79.5400 | 66.9100 | 29.0800 |
80
- |<span style="color: red;">Orion 8x7B |<span style="color: red;">79.7399|<span style="color: red;">55.82 |<span style="color: red;">49.93 |<span style="color: red;">87.32 |<span style="color: red;">73.05 |<span style="color: red;">30.06 |
81
- |**Model**|**GQPA**|**human-eval**|**MBPP**|**math_lv5**|**gsm8k**|**math**|
82
- |Mixtral 8x7B | 30.9000 | 33.5366 | 60.7000 | 9.0000 | 47.5000 | 28.4000 |
83
- |Qwen1.5-32b | 33.4900 | 35.9756 | 49.4000 | 25.0000 | 77.4000 | 36.1000 |
84
- |Qwen2.5-32b | 49.5000 | 46.9512 | 71.0000 | 31.7200 | 80.3630 | 48.8800 |
85
- |Orion 14B | 28.5300 | 20.1200 | 30.0000 | 2.5400 | 52.0100 | 7.8400 |
86
- |<span style="color: red;">Orion 8x7B |<span style="color: red;">52.1700 |<span style="color: red;">44.5122 |<span style="color: red;">43.4 |<span style="color: red;">5.07 |<span style="color: red;">59.8200 |<span style="color: red;">23.6800 |
 
 
87
 
88
 
89
  ### 3.1.2. 小语种: 日文
@@ -93,7 +95,7 @@
93
  |Qwen1.5-32B | 0.8986 | 0.8454 | 0.5099 | 0.9708 | 0.8214 | 0.4380 | 0.7474 |
94
  |Qwen2.5-32B | 0.8909 | 0.9383 | 0.7214 | 0.9786 | 0.8927 | 0.4215 | 0.8073 |
95
  |Orion-14B-Base | 0.7422 | 0.8820 | 0.7285 | 0.9406 | 0.6620 | 0.4990 | 0.7424 |
96
- |<span style="color: red;">Orion 8x7B |<span style="color: red;">0.9177 |<span style="color: red;">0.9043 |<span style="color: red;">0.9046 |<span style="color: red;">0.9640 |<span style="color: red;">0.8119 |<span style="color: red;">0.4735 |<span style="color: red;">0.8293 |
97
 
98
 
99
  ### 3.1.3. 小语种: 韩文
@@ -103,33 +105,33 @@
103
  |Qwen1.5-32B | 46.38 | 76.28 | 60.4 | 53 | 78.34 | 52.14 | 43.4 | 58.56285714 |
104
  |Qwen2.5-32B | 70.67 | 80.27 | 76.7 | 61.2 | 96.47 | 77.22 | 37.05 | 71.36857143 |
105
  |Orion-14B-Base | 69.66 | 80.63 | 77.1 | 58.2 | 92.44 | 51.19 | 44.55 | 67.68142857 |
106
- |<span style="color: red;">Orion 8x7B |<span style="color: red;">65.17 |<span style="color: red;">85.4 |<span style="color: red;">80.4 |<span style="color: red;">56 |<span style="color: red;">96.98 |<span style="color: red;">73.57 |<span style="color: red;">46.35 |<span style="color: red;">71.98142857 |
 
107
 
108
 
109
  ### 3.1.4. 小语种: 阿拉伯语,德语,法语,西班牙语
110
  | Lang | ar | | de | | fr | | es | |
111
- |----|----|----|----|----|----|----|----|----|
112
  |**model**|**hellaswag**|**arc**|**hellaswag**|**arc**|**hellaswag**|**arc**|**hellaswag**|**arc**|
113
  |Mixtral-8x7B | 47.93 | 36.27 | 69.17 | 52.35 | 73.9 | 55.86 | 74.25 | 54.79 |
114
  |Qwen1.5-32B | 50.07 | 39.95 | 63.77 | 50.81 | 68.86 | 55.95 | 70.5 | 55.13 |
115
  |Qwen2.5-32B | 59.76 | 52.87 | 69.82 | 61.76 | 74.15 | 62.7 | 75.04 | 65.3 |
116
  |Orion-14B-Base | 42.26 | 33.88 | 54.65 | 38.92 | 60.21 | 42.34 | 62 | 44.62 |
117
- |<span style="color: red;">Orion 8x7B |<span style="color: red;">69.39 |<span style="color: red;">54.32 |<span style="color: red;">80.6 |<span style="color: red;">63.47 |<span style="color: red;">85.56 |<span style="color: red;">68.78 |<span style="color: red;">87.41 |<span style="color: red;">70.09 |
118
 
119
 
120
  ### 3.1.5. 泄漏检测结果
121
  检测测试题目的泄露程度,值越大泄露的越严重
122
  - 检测代码: https://github.com/nishiwen1214/Benchmark-leakage-detection
123
  - 论文: https://web3.arxiv.org/pdf/2409.01790
124
- - 博客: https://mp.weixin.qq.com/s/BtcJmDEUyzAYG-fqCal2lA
125
  - 英文测试:mmlu
126
  - 中文测试:ceval, cmmlu
127
 
128
  |Threshold 0.2 | qwen2.5 32b | qwen1.5 32b | orion 8x7b | orion 14b | mixtral 8x7b |
129
  |----|----|----|----|----|----|
130
- |mmlu | 0.3 | 0.27 |<span style="color: red;">0.22 | 0.28 | 0.25 |
131
- |ceval | 0.39 | 0.38 |<span style="color: red;">0.27 | 0.26 | 0.26 |
132
- |cmmlu | 0.38 | 0.39 |<span style="color: red;">0.23 | 0.27 | 0.22 |
133
 
134
 
135
  ### 3.1.6. 推理速度
 
65
  ## 3.1. 基座模型Orion-MOE8x7B-Base评估
66
 
67
  ### 3.1.1. 基座模型基准测试对比
68
+ |TestSet | Mixtral 8*7B | Qwen1.5-32b | Qwen2.5-32b | Orion 14B | Orion 8*7B|
69
+ | -- | -- | -- | -- | -- | -- |
70
+ |ceval | 54.0861 | 83.5 | 87.7414 | 72.8 | 89.74|
71
+ |cmmlu | 53.21 | 82.3 | 89.0088 | 70.57 | 89.1555|
72
+ |mmlu | 70.4 | 73.4 | 82.9 | 69.94 | 85.9|
73
+ |mmlu_pro | 38.5 | 45.25 | 58.01 | 33.95 | 58.31|
74
+ |ARC_c | 85.0847 | 90.1695 | 94.2373 | 79.66 | 91.8644|
75
+ |hellaswag | 81.9458 | 81.9757 | 82.5134 | 78.53 | 89.19|
76
+ |lambada | 76.7902 | 73.7434 | 75.3736 | 78.83 | 79.7399|
77
+ |bbh | 50.87 | 57.28 | 67.69 | 50.35 | 55.82|
78
+ |musr | 43.21 | 42.65 | 49.78 | 43.61 | 49.93|
79
+ |piqa | 83.41 | 82.15 | 80.05 | 79.54 | 87.32|
80
+ |commonsense_qa | 69.62 | 74.69 | 72.97 | 66.91 | 73.05|
81
+ |IFEval | 24.15 | 32.97 | 41.59 | 29.08 | 30.06|
82
+ |GQPA | 30.9 | 33.49 | 49.5 | 28.53 | 52.17|
83
+ |human-eval | 33.5366 | 35.9756 | 46.9512 | 20.12 | 44.5122|
84
+ |MBPP | 60.7 | 49.4 | 71 | 30 | 43.4|
85
+ |math lv5 | 9 | 25 | 31.72 | 2.54 | 5.07|
86
+ |gsm8k | 47.5 | 77.4 | 80.363 | 52.01 | 59.82|
87
+ |math | 28.4 | 36.1 | 48.88 | 7.84 | 23.68|
88
+
89
 
90
 
91
  ### 3.1.2. 小语种: 日文
 
95
  |Qwen1.5-32B | 0.8986 | 0.8454 | 0.5099 | 0.9708 | 0.8214 | 0.4380 | 0.7474 |
96
  |Qwen2.5-32B | 0.8909 | 0.9383 | 0.7214 | 0.9786 | 0.8927 | 0.4215 | 0.8073 |
97
  |Orion-14B-Base | 0.7422 | 0.8820 | 0.7285 | 0.9406 | 0.6620 | 0.4990 | 0.7424 |
98
+ |Orion 8x7B |0.9177 |0.9043 |0.9046 |0.9640 |0.8119 |0.4735 |0.8293 |
99
 
100
 
101
  ### 3.1.3. 小语种: 韩文
 
105
  |Qwen1.5-32B | 46.38 | 76.28 | 60.4 | 53 | 78.34 | 52.14 | 43.4 | 58.56285714 |
106
  |Qwen2.5-32B | 70.67 | 80.27 | 76.7 | 61.2 | 96.47 | 77.22 | 37.05 | 71.36857143 |
107
  |Orion-14B-Base | 69.66 | 80.63 | 77.1 | 58.2 | 92.44 | 51.19 | 44.55 | 67.68142857 |
108
+ |Orion 8x7B |65.17 |85.4 |80.4 |56 |96.98 |73.57 |46.35 |71.98142857 |
109
+
110
 
111
 
112
  ### 3.1.4. 小语种: 阿拉伯语,德语,法语,西班牙语
113
  | Lang | ar | | de | | fr | | es | |
114
+ |------|----|--|----|--|----|--|----|--|
115
  |**model**|**hellaswag**|**arc**|**hellaswag**|**arc**|**hellaswag**|**arc**|**hellaswag**|**arc**|
116
  |Mixtral-8x7B | 47.93 | 36.27 | 69.17 | 52.35 | 73.9 | 55.86 | 74.25 | 54.79 |
117
  |Qwen1.5-32B | 50.07 | 39.95 | 63.77 | 50.81 | 68.86 | 55.95 | 70.5 | 55.13 |
118
  |Qwen2.5-32B | 59.76 | 52.87 | 69.82 | 61.76 | 74.15 | 62.7 | 75.04 | 65.3 |
119
  |Orion-14B-Base | 42.26 | 33.88 | 54.65 | 38.92 | 60.21 | 42.34 | 62 | 44.62 |
120
+ |Orion 8x7B |69.39 |54.32 |80.6 |63.47 |85.56 |68.78 |87.41 |70.09 |
121
 
122
 
123
  ### 3.1.5. 泄漏检测结果
124
  检测测试题目的泄露程度,值越大泄露的越严重
125
  - 检测代码: https://github.com/nishiwen1214/Benchmark-leakage-detection
126
  - 论文: https://web3.arxiv.org/pdf/2409.01790
 
127
  - 英文测试:mmlu
128
  - 中文测试:ceval, cmmlu
129
 
130
  |Threshold 0.2 | qwen2.5 32b | qwen1.5 32b | orion 8x7b | orion 14b | mixtral 8x7b |
131
  |----|----|----|----|----|----|
132
+ |mmlu | 0.3 | 0.27 | 0.22 | 0.28 | 0.25 |
133
+ |ceval | 0.39 | 0.38 | 0.27 | 0.26 | 0.26 |
134
+ |cmmlu | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
135
 
136
 
137
  ### 3.1.6. 推理速度