metadata

license: apache-2.0
datasets:
  - OpenRLHF/prompt-collection-v0.1
base_model:
  - meta-llama/Llama-3.2-1B-Instruct
library_name: transformers

This model's benchmark results

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
tinyBenchmarks	N/A
- tinyArc	0	none	25	acc_norm	↑	0.4253	±	N/A
- tinyGSM8k	0	flexible-extract	5	exact_match	↑	0.3768	±	N/A
		strict-match	5	exact_match	↑	0.3768	±	N/A
- tinyHellaswag	0	none	10	acc_norm	↑	0.5379	±	N/A
- tinyMMLU	0	none	0	acc_norm	↑	0.4483	±	N/A
- tinyTruthfulQA	0	none	0	acc	↑	0.4217	±	N/A
- tinyWinogrande	0	none	5	acc_norm	↑	0.5366	±	N/A

Original `meta-llama/Llama-3.2-1B-Instruct` benchmark results

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
tinyBenchmarks	N/A
- tinyArc	0	none	25	acc_norm	↑	0.4145	±	N/A
- tinyGSM8k	0	flexible-extract	5	exact_match	↑	0.3412	±	N/A
		strict-match	5	exact_match	↑	0.3412	±	N/A
- tinyHellaswag	0	none	10	acc_norm	↑	0.5335	±	N/A
- tinyMMLU	0	none	0	acc_norm	↑	0.4298	±	N/A
- tinyTruthfulQA	0	none	0	acc	↑	0.4288	±	N/A
- tinyWinogrande	0	none	5	acc_norm	↑	0.5366	±	N/A

Below is a side-by-side comparison of the two result sets. For each task, the higher value (i.e., “better” on that metric) is highlighted in bold:

Task	this	orig	Better?
tinyArc (acc_norm)	0.4253	0.4145	v1 higher
tinyGSM8k (exact_match)	0.3768	0.3412	v1 higher
tinyHellaswag (acc_norm)	0.5379	0.5335	v1 higher
tinyMMLU (acc_norm)	0.4483	0.4298	v1 higher
tinyTruthfulQA (acc)	0.4217	0.4288	v2 higher
tinyWinogrande (acc_norm)	0.5366	0.5366	tie

Observations

Ours outperforms the original on four tasks (tinyArc, tinyGSM8k, tinyHellaswag, tinyMMLU).
The original outperforms ours on one task (tinyTruthfulQA).
One task is a tie (tinyWinogrande).

Given these comparisons, our results are stronger overall because it has higher scores on the majority of tasks. The only exception is on tinyTruthfulQA, where the original scores slightly better, and on tinyWinogrande, both versions tie.

This model's benchmark results

Original meta-llama/Llama-3.2-1B-Instruct benchmark results

Observations

Original `meta-llama/Llama-3.2-1B-Instruct` benchmark results