Spaces:
Runtime error
Runtime error
A newer version of the Streamlit SDK is available:
1.41.1
metadata
title: Italian Open LLM Leaderboard
emoji: π
colorFrom: red
colorTo: green
sdk: streamlit
sdk_version: 1.34.0
app_file: main.py
pinned: true
license: apache-2.0
π Italian LLM-Leaderboard
Italian leaderboard
Leaderboard
Model Name | Year | Publisher | # Params | Lang. | Avg. | Avg. (0-shot) | Avg. (N-shot) | MMLU (0-shot) | MMLU (5-shot) | ARC-C (0-shot) | ARC-C (25-shot) | HellaSwag (0-shot) | HellaSwag (10-shot) | TruthfulQA (0-shot) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DanteLLM | 2023 | RSTLess (Sapienza University of Rome) | 7B | Italian FT | 47.52 | 47.34 | 47.69 | 47.05 | 48.27 | 41.89 | 47.01 | 47.99 | 47.79 | 52.41 |
OpenDanteLLM | 2023 | RSTLess (Sapienza University of Rome) | 7B | Italian FT | 45.97 | 45.13 | 46.80 | 44.25 | 46.89 | 41.72 | 46.76 | 46.49 | 46.75 | 48.06 |
Mistral v0.2 | 2023 | Mistral AI | 7B | English | 44.29 | 45.15 | 43.43 | 44.66 | 45.84 | 37.46 | 41.47 | 43.48 | 42.99 | 54.99 |
LLaMAntino | 2024 | Bari University | 7B | Italian FT | 41.66 | 40.86 | 42.46 | 33.89 | 38.74 | 38.22 | 41.72 | 46.30 | 46.91 | 45.03 |
Fauno2 | 2023 | RSTLess (Sapienza University of Rome) | 7B | Italian FT | 41.74 | 42.90 | 40.57 | 40.30 | 38.32 | 36.26 | 39.33 | 44.25 | 44.07 | 50.77 |
Fauno1 | 2023 | RSTLess (Sapienza University of Rome) | 7B | Italian FT | 36.91 | 37.20 | 36.61 | 28.79 | 30.45 | 33.10 | 36.52 | 43.13 | 42.86 | 43.78 |
Camoscio | 2023 | Gladia (Sapienza University of Rome) | 7B | Italian FT | 37.22 | 38.01 | 36.42 | 30.53 | 29.38 | 33.28 | 36.60 | 42.91 | 43.29 | 45.33 |
LLaMA2 | 2022 | Meta | 7B | English | 39.50 | 39.14 | 39.86 | 34.12 | 37.91 | 33.28 | 37.71 | 44.31 | 43.97 | 44.83 |
BloomZ | 2022 | BigScience | 7B | Multilingual | 33.97 | 36.01 | 31.93 | 36.40 | 31.67 | 27.30 | 28.24 | 34.83 | 35.88 | 45.52 |
iT5 | 2022 | Groningen University | 738M | Italian | 29.27 | 32.42 | 26.11 | 23.69 | 24.31 | 27.39 | 27.99 | 28.11 | 26.04 | 50.49 |
GePpeTto | 2020 | Pisa/Groningen University, FBK, Aptus.AI | 117M | Italian | 27.86 | 30.89 | 24.82 | 22.87 | 24.39 | 24.15 | 25.08 | 26.34 | 24.99 | 50.20 |
mT5 | 2020 | 3.7B | Multilingual | 29.00 | 30.99 | 27.01 | 25.56 | 25.60 | 25.94 | 27.56 | 26.96 | 27.86 | 45.50 | |
Minerva 3B | 2024 | SapienzaNLP (Sapienza University of Rome) | 3B | Multilingual | 33.94 | 34.37 | 33.52 | 24.62 | 26.50 | 30.29 | 30.89 | 42.38 | 43.16 | 40.18 |
Minerva 1B | 2024 | SapienzaNLP (Sapienza University of Rome) | 1B | Multilingual | 29.78 | 31.46 | 28.09 | 24.69 | 24.94 | 24.32 | 25.25 | 34.01 | 34.07 | 42.84 |
Minerva 350M | 2024 | SapienzaNLP (Sapienza University of Rome) | 350M | Multilingual | 28.35 | 30.72 | 26 | 23.10 | 24.29 | 23.21 | 24.32 | 29.33 | 29.37 | 47.23 |
Modello Italia | 2024 | iGenius | 9B | Italian | 41.22 | 40.89 | 41.67 | 39.76 | 41.01 | 34.81 | 39.16 | 43.97 | 44.85 | 45.01 |
Benchmarks
Benchmark Name | Author | Link | Description |
---|---|---|---|
ARC Challenge | Clark et al. | https://arxiv.org/abs/1803.05457 | "We present a new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. The ARC question set is partitioned into a Challenge Set and an Easy Set, where the Challenge Set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm. The dataset contains only natural, grade-school science questions (authored for human tests), and is the largest public-domain set of this kind (7,787 questions). We test several baselines on the Challenge Set, including leading neural models from the SQuAD and SNLI tasks, and find that none are able to significantly outperform a random baseline, reflecting the difficult nature of this task. We are also releasing the ARC Corpus, a corpus of 14M science sentences relevant to the task, and implementations of the three neural baseline models tested. Can your model perform better? We pose ARC as a challenge to the community. |
HellaSwag | Zellers et al. | https://arxiv.org/abs/1905.07830v1 | "HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy)." (Source: https://paperswithcode.com/dataset/hellaswag) |
MMLU | Hendrycks et al. | https://github.com/hendrycks/test | "The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a modelβs blind spots." (Source: "https://paperswithcode.com/dataset/mmlu") |
TruthfulQA | Li et al. | https://arxiv.org/abs/2109.07958 | "We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. We crafted questions that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. We tested GPT-3, GPT-Neo/J, GPT-2 and a T5-based model. The best model was truthful on 58% of questions, while human performance was 94%. Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. The largest models were generally the least truthful. This contrasts with other NLP tasks, where performance improves with model size. However, this result is expected if false answers are learned from the training distribution. We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web. |
Authors
- Andrea Bacciu* (Work done prior joining Amazon)
- Cesare Campagnano*
- Giovanni Trappolini
- Prof. Fabrizio Silvestri
* Equal contribution.
Ack
Special thanks to https://github.com/LudwigStumpp/llm-leaderboard for the initial inspiration and codebase.