File size: 4,439 Bytes
c347fbc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
---
license: llama3.2
datasets:
- BAAI/Infinity-Instruct
base_model:
- meta-llama/Llama-3.2-1B-Instruct
---
## Model Overview
This weight is a fine-tuned version of **[Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)** using the **[LLM-Neo](https://arxiv.org/abs/2411.06839)** method. Usage is identical to the original Llama-3.2-1B-Instruct model.
## Training Details
The training process employs the **LLM-Neo** method. The dataset is derived from a mixed sample of **[BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)**, specifically the `0625` and `7M` subsets, with a total of 10k instruction samples. The KD (knowledge distillation) model used is **[Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)**, with the following hyperparameters:
- **Learning Rate**: 1e-4
- **Epochs**: 1
- **KD Ratio**: 0.9
- **Rank**: 128
## Model Performance Evaluation
<img src="https://raw.githubusercontent.com/Rummyyang/Rummyyang.github.io/refs/heads/main/img/radar_chart_neo_llama3.2_larger_text-1120-1-1.png" alt="Neo_radar" width="600">
<!-- ![Neo_radar](https://raw.githubusercontent.com/Rummyyang/Rummyyang.github.io/refs/heads/main/img/radar_chart_neo_llama3.2_larger_text-1120-1-1.png) -->
The evaluation of this model is divided into two parts: results from **[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)** and **[math-evaluation-harness](https://github.com/ZubinGou/math-evaluation-harness)** frameworks.
> **Note**: The results are influenced by the specific benchmark versions and testing hardware/software configurations.
> Therefore, the reported metrics should be interpreted as relative performance within a given setup.
### Part 1: lm-evaluation-harness results
In this part, the model was evaluated on several widely-used benchmark datasets, covering reasoning, commonsense, mathematics, and language understanding tasks. Below is a detailed comparison of the performance metrics between **Llama-3.2-1B-Instruct** and the current model:
| Dataset | Llama-3.2-1B-Instruct | Llama-3.2-1B-Instruct-Neo |
|---------------------|------------------------|---------------|
| ARC Challenge | 36.09 | 36.43 |
| ARC Easy | 68.52 | 67.51 |
| CEval | 39.45 | 39.67 |
| CMMLU | 35.62 | 36.48 |
| MMLU | 45.91 | 46.27 |
| HellaSwag | 45.07 | 45.84 |
| OpenBookQA | 24.40 | 25.40 |
| PIQA | 73.88 | 74.32 |
| Winogrande | 59.27 | 61.17 |
The results demonstrate that the current model outperforms **Llama-3.2-1B-Instruct** in several tasks, especially in reasoning tasks (e.g., **Winogrande**) and commonsense tasks (e.g., **PIQA**).
---
### Part 2: math-evaluation-harness results
In this part, the model was evaluated specifically on mathematical reasoning and related tasks, focusing on its ability to handle complex mathematical problems.
| Dataset | Llama-3.2-1B-Instruct | Llama-3.2-1B-Instruct-Neo |
|---------------------|------------------------|---------------|
| GSM8K | 35.00 | 39.30 |
| Minerva Math | 14.80 | 22.80 |
| SVAMP | 50.40 | 54.50 |
| ASDiv | 67.40 | 71.20 |
| MAWPS | 83.50 | 85.60 |
| TabMWP | 41.90 | 35.40 |
| MathQ | 44.20 | 48.30 |
| MMLU-STEM | 37.90 | 38.90 |
The mathematical evaluation highlights significant improvements of the current model in handling complex problems, with notable progress on datasets such as **Minerva Math** and **GSM8K**.
---
### Summary
- **Strengths**: The current model demonstrates notable improvements over **Llama-3.2-1B-Instruct** across multiple benchmark tasks, particularly in reasoning and mathematical problem-solving.
- **Future Directions**: Further optimization in logical reasoning tasks (e.g., **TabMWP**) and continued enhancements in general language and mathematical adaptability.
|