Losin94 commited on
Commit
3b832a9
1 Parent(s): ce45c39

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -0
README.md CHANGED
@@ -33,6 +33,42 @@ KeyError: 'qwen2'
33
 
34
  We do not advise you to use base language models for text generation. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., on this model.
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  ## Citation
38
 
 
33
 
34
  We do not advise you to use base language models for text generation. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., on this model.
35
 
36
+ ## Performance
37
+
38
+
39
+ The evaluation of base models mainly focuses on the model performance of natural language understanding, general question answering, coding, mathematics, scientific knowledge, reasoning, multilingual capability, etc.
40
+
41
+ The datasets for evaluation include:
42
+
43
+ **English Tasks**: MMLU (5-shot), MMLU-Pro (5-shot), GPQA (5shot), Theorem QA (5-shot), BBH (3-shot), HellaSwag (10-shot), Winogrande (5-shot), TruthfulQA (0-shot), ARC-C (25-shot)
44
+
45
+ **Coding Tasks**: EvalPlus (0-shot) (HumanEval, MBPP, HumanEval+, MBPP+), MultiPL-E (0-shot) (Python, C++, JAVA, PHP, TypeScript, C#, Bash, JavaScript)
46
+
47
+ **Math Tasks**: GSM8K (4-shot), MATH (4-shot)
48
+
49
+ **Chinese Tasks**: C-Eval(5-shot), CMMLU (5-shot)
50
+
51
+ **Multilingual Tasks**: Multi-Exam (M3Exam 5-shot, IndoMMLU 3-shot, ruMMLU 5-shot, mMMLU 5-shot), Multi-Understanding (BELEBELE 5-shot, XCOPA 5-shot, XWinograd 5-shot, XStoryCloze 0-shot, PAWS-X 5-shot), Multi-Mathematics (MGSM 8-shot), Multi-Translation (Flores-101 5-shot)
52
+
53
+ #### Qwen2-0.5B & Qwen2-1.5B performances
54
+ | Datasets | Phi-2 | Gemma-2B | MiniCPM | Qwen1.5-1.8B | Qwen2-0.5B | Qwen2-1.5B |
55
+ | :--------| :---------: | :------------: | :------------: |:------------: | :------------: | :------------: |
56
+ |#Non-Emb Params | 2.5B | 2.0B | 2.4B | 1.3B | 0.35B | 1.3B |
57
+ |MMLU | 52.7 | 42.3 | 53.5 | 46.8 | 45.4 | **56.5** |
58
+ |MMLU-Pro | - | 15.9 | - | - | 14.7 | 21.8 |
59
+ |Theorem QA | - | - | - |- | 8.9 | **15.0** |
60
+ |HumanEval | 47.6 | 22.0 |**50.0**| 20.1 | 22.0 | 31.1 |
61
+ |MBPP | **55.0** | 29.2 | 47.3 | 18.0 | 22.0 | 37.4 |
62
+ |GSM8K | 57.2 | 17.7 | 53.8 | 38.4 | 36.5 | **58.5** |
63
+ |MATH | 3.5 | 11.8 | 10.2 | 10.1 | 10.7 | **21.7** |
64
+ |BBH | **43.4** | 35.2 | 36.9 | 24.2 | 28.4 | 37.2 |
65
+ |HellaSwag | **73.1** | 71.4 | 68.3 | 61.4 | 49.3 | 66.6 |
66
+ |Winogrande | **74.4** | 66.8 | -| 60.3 | 56.8 | 66.2 |
67
+ |ARC-C | **61.1** | 48.5 | -| 37.9 | 31.5 | 43.9 |
68
+ |TruthfulQA | 44.5 | 33.1 | -| 39.4 | 39.7 | **45.9** |
69
+ |C-Eval | 23.4 | 28.0 | 51.1| 59.7 | 58.2 | **70.6** |
70
+ |CMMLU | 24.2 | - | 51.1 | 57.8 | 55.1 | **70.3** |
71
+
72
 
73
  ## Citation
74