Runming Yang commited on
Commit
c347fbc
·
1 Parent(s): e015a93

Upload model files

Browse files
README.md CHANGED
@@ -1,3 +1,78 @@
1
- ---
2
- license: llama3.2
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.2
3
+ datasets:
4
+ - BAAI/Infinity-Instruct
5
+ base_model:
6
+ - meta-llama/Llama-3.2-1B-Instruct
7
+ ---
8
+
9
+ ## Model Overview
10
+
11
+ This weight is a fine-tuned version of **[Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)** using the **[LLM-Neo](https://arxiv.org/abs/2411.06839)** method. Usage is identical to the original Llama-3.2-1B-Instruct model.
12
+
13
+ ## Training Details
14
+
15
+ The training process employs the **LLM-Neo** method. The dataset is derived from a mixed sample of **[BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)**, specifically the `0625` and `7M` subsets, with a total of 10k instruction samples. The KD (knowledge distillation) model used is **[Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)**, with the following hyperparameters:
16
+
17
+ - **Learning Rate**: 1e-4
18
+ - **Epochs**: 1
19
+ - **KD Ratio**: 0.9
20
+ - **Rank**: 128
21
+
22
+ ## Model Performance Evaluation
23
+
24
+ <img src="https://raw.githubusercontent.com/Rummyyang/Rummyyang.github.io/refs/heads/main/img/radar_chart_neo_llama3.2_larger_text-1120-1-1.png" alt="Neo_radar" width="600">
25
+
26
+ <!-- ![Neo_radar](https://raw.githubusercontent.com/Rummyyang/Rummyyang.github.io/refs/heads/main/img/radar_chart_neo_llama3.2_larger_text-1120-1-1.png) -->
27
+
28
+ The evaluation of this model is divided into two parts: results from **[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)** and **[math-evaluation-harness](https://github.com/ZubinGou/math-evaluation-harness)** frameworks.
29
+
30
+ > **Note**: The results are influenced by the specific benchmark versions and testing hardware/software configurations.
31
+ > Therefore, the reported metrics should be interpreted as relative performance within a given setup.
32
+
33
+ ### Part 1: lm-evaluation-harness results
34
+
35
+ In this part, the model was evaluated on several widely-used benchmark datasets, covering reasoning, commonsense, mathematics, and language understanding tasks. Below is a detailed comparison of the performance metrics between **Llama-3.2-1B-Instruct** and the current model:
36
+
37
+ | Dataset | Llama-3.2-1B-Instruct | Llama-3.2-1B-Instruct-Neo |
38
+ |---------------------|------------------------|---------------|
39
+ | ARC Challenge | 36.09 | 36.43 |
40
+ | ARC Easy | 68.52 | 67.51 |
41
+ | CEval | 39.45 | 39.67 |
42
+ | CMMLU | 35.62 | 36.48 |
43
+ | MMLU | 45.91 | 46.27 |
44
+ | HellaSwag | 45.07 | 45.84 |
45
+ | OpenBookQA | 24.40 | 25.40 |
46
+ | PIQA | 73.88 | 74.32 |
47
+ | Winogrande | 59.27 | 61.17 |
48
+
49
+ The results demonstrate that the current model outperforms **Llama-3.2-1B-Instruct** in several tasks, especially in reasoning tasks (e.g., **Winogrande**) and commonsense tasks (e.g., **PIQA**).
50
+
51
+ ---
52
+
53
+ ### Part 2: math-evaluation-harness results
54
+
55
+ In this part, the model was evaluated specifically on mathematical reasoning and related tasks, focusing on its ability to handle complex mathematical problems.
56
+
57
+ | Dataset | Llama-3.2-1B-Instruct | Llama-3.2-1B-Instruct-Neo |
58
+ |---------------------|------------------------|---------------|
59
+ | GSM8K | 35.00 | 39.30 |
60
+ | Minerva Math | 14.80 | 22.80 |
61
+ | SVAMP | 50.40 | 54.50 |
62
+ | ASDiv | 67.40 | 71.20 |
63
+ | MAWPS | 83.50 | 85.60 |
64
+ | TabMWP | 41.90 | 35.40 |
65
+ | MathQ | 44.20 | 48.30 |
66
+ | MMLU-STEM | 37.90 | 38.90 |
67
+
68
+ The mathematical evaluation highlights significant improvements of the current model in handling complex problems, with notable progress on datasets such as **Minerva Math** and **GSM8K**.
69
+
70
+ ---
71
+
72
+
73
+
74
+ ### Summary
75
+
76
+ - **Strengths**: The current model demonstrates notable improvements over **Llama-3.2-1B-Instruct** across multiple benchmark tasks, particularly in reasoning and mathematical problem-solving.
77
+ - **Future Directions**: Further optimization in logical reasoning tasks (e.g., **TabMWP**) and continued enhancements in general language and mathematical adaptability.
78
+
config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:63a5518e13fcb8c3d6874e4384fdcaccdf13f93691e4e11efa111ca66984eb0a
3
+ size 872
generation_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:46650402223e517e09ac32797ba8cff47cf4cfea248aed800a76a0c50ba4e92d
3
+ size 184
model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5516da399f3d61e832ef8ec884018c78cf10de495b072161c19387ffd20efc78
3
+ size 1997648472
model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:15726f8ace29adffc550ea0fdccdb96bc5a53b8270aaf01d9762149865dacb50
3
+ size 473997096
model.safetensors.index.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a2807004134bcb4f726e56cd6a9a60f499e9a047116729f3572397d968970623
3
+ size 12003
special_tokens_map.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1b1835caa5b4d70acaa210fa222b0036f1882f9525c4660fd4810fb3e1e40ff8
3
+ size 325
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6b9e4e7fb171f92fd137b777cc2714bf87d11576700a1dcd7a399e7bbe39537b
3
+ size 17209920
tokenizer_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:50536ab56a629f13c0227a1658c5c040cde997239f94dc6d9df3db2128e5ade0
3
+ size 54616