Update README.md
Browse files
README.md
CHANGED
@@ -33,6 +33,42 @@ KeyError: 'qwen2'
|
|
33 |
|
34 |
We do not advise you to use base language models for text generation. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., on this model.
|
35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
|
37 |
## Citation
|
38 |
|
|
|
33 |
|
34 |
We do not advise you to use base language models for text generation. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., on this model.
|
35 |
|
36 |
+
## Performance
|
37 |
+
|
38 |
+
|
39 |
+
The evaluation of base models mainly focuses on the model performance of natural language understanding, general question answering, coding, mathematics, scientific knowledge, reasoning, multilingual capability, etc.
|
40 |
+
|
41 |
+
The datasets for evaluation include:
|
42 |
+
|
43 |
+
**English Tasks**: MMLU (5-shot), MMLU-Pro (5-shot), GPQA (5shot), Theorem QA (5-shot), BBH (3-shot), HellaSwag (10-shot), Winogrande (5-shot), TruthfulQA (0-shot), ARC-C (25-shot)
|
44 |
+
|
45 |
+
**Coding Tasks**: EvalPlus (0-shot) (HumanEval, MBPP, HumanEval+, MBPP+), MultiPL-E (0-shot) (Python, C++, JAVA, PHP, TypeScript, C#, Bash, JavaScript)
|
46 |
+
|
47 |
+
**Math Tasks**: GSM8K (4-shot), MATH (4-shot)
|
48 |
+
|
49 |
+
**Chinese Tasks**: C-Eval(5-shot), CMMLU (5-shot)
|
50 |
+
|
51 |
+
**Multilingual Tasks**: Multi-Exam (M3Exam 5-shot, IndoMMLU 3-shot, ruMMLU 5-shot, mMMLU 5-shot), Multi-Understanding (BELEBELE 5-shot, XCOPA 5-shot, XWinograd 5-shot, XStoryCloze 0-shot, PAWS-X 5-shot), Multi-Mathematics (MGSM 8-shot), Multi-Translation (Flores-101 5-shot)
|
52 |
+
|
53 |
+
#### Qwen2-0.5B & Qwen2-1.5B performances
|
54 |
+
| Datasets | Phi-2 | Gemma-2B | MiniCPM | Qwen1.5-1.8B | Qwen2-0.5B | Qwen2-1.5B |
|
55 |
+
| :--------| :---------: | :------------: | :------------: |:------------: | :------------: | :------------: |
|
56 |
+
|#Non-Emb Params | 2.5B | 2.0B | 2.4B | 1.3B | 0.35B | 1.3B |
|
57 |
+
|MMLU | 52.7 | 42.3 | 53.5 | 46.8 | 45.4 | **56.5** |
|
58 |
+
|MMLU-Pro | - | 15.9 | - | - | 14.7 | 21.8 |
|
59 |
+
|Theorem QA | - | - | - |- | 8.9 | **15.0** |
|
60 |
+
|HumanEval | 47.6 | 22.0 |**50.0**| 20.1 | 22.0 | 31.1 |
|
61 |
+
|MBPP | **55.0** | 29.2 | 47.3 | 18.0 | 22.0 | 37.4 |
|
62 |
+
|GSM8K | 57.2 | 17.7 | 53.8 | 38.4 | 36.5 | **58.5** |
|
63 |
+
|MATH | 3.5 | 11.8 | 10.2 | 10.1 | 10.7 | **21.7** |
|
64 |
+
|BBH | **43.4** | 35.2 | 36.9 | 24.2 | 28.4 | 37.2 |
|
65 |
+
|HellaSwag | **73.1** | 71.4 | 68.3 | 61.4 | 49.3 | 66.6 |
|
66 |
+
|Winogrande | **74.4** | 66.8 | -| 60.3 | 56.8 | 66.2 |
|
67 |
+
|ARC-C | **61.1** | 48.5 | -| 37.9 | 31.5 | 43.9 |
|
68 |
+
|TruthfulQA | 44.5 | 33.1 | -| 39.4 | 39.7 | **45.9** |
|
69 |
+
|C-Eval | 23.4 | 28.0 | 51.1| 59.7 | 58.2 | **70.6** |
|
70 |
+
|CMMLU | 24.2 | - | 51.1 | 57.8 | 55.1 | **70.3** |
|
71 |
+
|
72 |
|
73 |
## Citation
|
74 |
|