|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
Zero-shot results when using the [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) as the teacher model, and the [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) as the initialized model |
|
|
|
| Task | Llama-3.2-3B-Instruct | Llama3.2-Mamba-3B-distill | |
|
|---------------|------------------------|--------------------------| |
|
| arc_challenge | 0.459 | 0.4838 | |
|
| arc_easy | 0.7407 | 0.7765 | |
|
| hellaswag | 0.7043 | 0.7037 | |
|
| mmlu | 0.6043 | 0.5448 | |
|
| openbookqa | 0.36 | 0.394 | |
|
| piqa | 0.7568 | 0.7731 | |
|
| pubmedqa | 0.696 | 0.664 | |
|
| race | 0.4067 | 0.4029 | |
|
| winogrande | 0.6748 | 0.6732 | |
|
|
|
|
|
``` |
|
@article{junxiongdaniele2024mambainllama, |
|
title = {The Mamba in the Llama: Distilling and Accelerating Hybrid Models}, |
|
author = {Junxiong Wang and Daniele Paliotta and Avner May and Alexander M. Rush and Tri Dao}, |
|
journal = {arXiv preprint arXiv:2408.15237}, |
|
year = {2024} |
|
} |
|
``` |