File size: 4,439 Bytes
c347fbc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
license: llama3.2
datasets:
- BAAI/Infinity-Instruct
base_model:
- meta-llama/Llama-3.2-1B-Instruct
---

## Model Overview

This weight is a fine-tuned version of **[Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)** using the **[LLM-Neo](https://arxiv.org/abs/2411.06839)** method. Usage is identical to the original Llama-3.2-1B-Instruct model. 

## Training Details

The training process employs the **LLM-Neo** method. The dataset is derived from a mixed sample of **[BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)**, specifically the `0625` and `7M` subsets, with a total of 10k instruction samples. The KD (knowledge distillation) model used is **[Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)**, with the following hyperparameters:

- **Learning Rate**: 1e-4  
- **Epochs**: 1  
- **KD Ratio**: 0.9
- **Rank**: 128

## Model Performance Evaluation

<img src="https://raw.githubusercontent.com/Rummyyang/Rummyyang.github.io/refs/heads/main/img/radar_chart_neo_llama3.2_larger_text-1120-1-1.png" alt="Neo_radar" width="600">

<!-- ![Neo_radar](https://raw.githubusercontent.com/Rummyyang/Rummyyang.github.io/refs/heads/main/img/radar_chart_neo_llama3.2_larger_text-1120-1-1.png) -->

The evaluation of this model is divided into two parts: results from **[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)** and **[math-evaluation-harness](https://github.com/ZubinGou/math-evaluation-harness)** frameworks.

> **Note**: The results are influenced by the specific benchmark versions and testing hardware/software configurations.
> Therefore, the reported metrics should be interpreted as relative performance within a given setup.

### Part 1: lm-evaluation-harness results

In this part, the model was evaluated on several widely-used benchmark datasets, covering reasoning, commonsense, mathematics, and language understanding tasks. Below is a detailed comparison of the performance metrics between **Llama-3.2-1B-Instruct** and the current model:

| Dataset             | Llama-3.2-1B-Instruct | Llama-3.2-1B-Instruct-Neo |
|---------------------|------------------------|---------------|
| ARC Challenge       | 36.09                 | 36.43         |
| ARC Easy            | 68.52                 | 67.51         |
| CEval               | 39.45                 | 39.67         |
| CMMLU               | 35.62                 | 36.48         |
| MMLU                | 45.91                 | 46.27         |
| HellaSwag           | 45.07                 | 45.84         |
| OpenBookQA          | 24.40                 | 25.40         |
| PIQA                | 73.88                 | 74.32         |
| Winogrande          | 59.27                 | 61.17         |

The results demonstrate that the current model outperforms **Llama-3.2-1B-Instruct** in several tasks, especially in reasoning tasks (e.g., **Winogrande**) and commonsense tasks (e.g., **PIQA**).

---

### Part 2: math-evaluation-harness results

In this part, the model was evaluated specifically on mathematical reasoning and related tasks, focusing on its ability to handle complex mathematical problems.

| Dataset             | Llama-3.2-1B-Instruct | Llama-3.2-1B-Instruct-Neo |
|---------------------|------------------------|---------------|
| GSM8K               | 35.00                 | 39.30         |
| Minerva Math        | 14.80                 | 22.80         |
| SVAMP               | 50.40                 | 54.50         |
| ASDiv               | 67.40                 | 71.20         |
| MAWPS               | 83.50                 | 85.60         |
| TabMWP              | 41.90                 | 35.40         |
| MathQ               | 44.20                 | 48.30         |
| MMLU-STEM           | 37.90                 | 38.90         |

The mathematical evaluation highlights significant improvements of the current model in handling complex problems, with notable progress on datasets such as **Minerva Math** and **GSM8K**.

---



### Summary

- **Strengths**: The current model demonstrates notable improvements over **Llama-3.2-1B-Instruct** across multiple benchmark tasks, particularly in reasoning and mathematical problem-solving.
- **Future Directions**: Further optimization in logical reasoning tasks (e.g., **TabMWP**) and continued enhancements in general language and mathematical adaptability.