Aleph-Alpha
/

Pharia-1-LLM-7B-control

Text Generation

scaling

Model card Files Files and versions Community

Include general knowledge benchmarks

by NickyHavoc - opened Aug 28

base: refs/heads/main

←

from: refs/pr/4

Discussion Files changed

+31

-4

Files changed (1) hide show

README.md +31 -4

README.md CHANGED Viewed

@@ -235,12 +235,11 @@ While performing in the same ballpark as `llama-3.1-8b-instruct`, `Pharia-1-LLM-
 |     |     |     |     |     |
 | --- | --- | --- | --- | --- |
 | **Model** | **Quality DE**, 1 (bad) to 5 (great) | **Quality EN**, 1 (bad) to 5 (great) | **Concise**, in % | **Instruction following**, in % |
-| `llama-3.1-8b-instruct` | **3.62** | **4.01** | 89.7 | **83.6** |
 | `Pharia-1-LLM-7B-control` | 3.60 | 4.00 | **91.9** | 81.8 |
 | `Mistral-7B-Instruct-v0.3` | 3.47 | 3.88 | 88.5 | 80.4 |
-**Note:** We will add the engineering benchmark evaluations for `Pharia-1-LLM-7B-control-aligned` shortly.
 #### Performance on length-controlled completions
 “Absolute normalized distance to target” measures how much a model’s completions deviate from the desired length, calculated as:
@@ -278,7 +277,35 @@ We assessed each model’s ability to produce safe answers given prompts that te
 ### General Knowledge Benchmarks
-We acknowledge that while generic accuracy-based benchmarks such as [Open LLM Leaderboard v1](https://Hugging Face.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) provide a reproducible comparability of model performance, they have been designed for evaluation of pre-trained models and should not be mistaken for strong indicators of use-case-specific performance. In contrast to what [some research](https://arxiv.org/abs/2405.00332) might suggest for other models, our Pharia-1-LLM-7B models have not been tailored to such generic benchmarks, and naturally would be expected to underperform in these. We will continue to transparently evaluate also against these generic benchmarks and share the results here shortly.
 # Training Details

 |     |     |     |     |     |
 | --- | --- | --- | --- | --- |
 | **Model** | **Quality DE**, 1 (bad) to 5 (great) | **Quality EN**, 1 (bad) to 5 (great) | **Concise**, in % | **Instruction following**, in % |
+| `llama-3.1-8b-instruct` | **3.62** | 4.01 | 89.7 | **83.6** |
 | `Pharia-1-LLM-7B-control` | 3.60 | 4.00 | **91.9** | 81.8 |
+| `Pharia-1-LLM-7B-control-aligned` | 3.51 | **4.08** | 81.8 | 77.7 |
 | `Mistral-7B-Instruct-v0.3` | 3.47 | 3.88 | 88.5 | 80.4 |
 #### Performance on length-controlled completions
 “Absolute normalized distance to target” measures how much a model’s completions deviate from the desired length, calculated as:
 ### General Knowledge Benchmarks
+We acknowledge that while generic accuracy-based benchmarks such as [Open LLM Leaderboard v1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) provide a reproducible comparability of model performance, they have been designed for evaluation of pre-trained models and should not be mistaken for strong indicators of use-case-specific performance. In contrast to what [some research](https://arxiv.org/abs/2405.00332) might suggest for other models, our Pharia-1-LLM-7B models have not been tailored to such generic benchmarks, and naturally would be expected to underperform in these.
+| **Benchmark** | **Shots** | **Metric** | **Pharia-1-LLM-7B-control** | **Pharia-1-LLM-7B-control-aligned** | **Llama-3.1-8B-Instruct** | **Mistral-7B-Instruct-v0.3** |
+| --- | --- | --- | --- | --- | --- | --- |
+| 1.  **General Knowledge:** [**Open LLM Leaderboard V1**](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) |     |     |     |     |     |     |
+| ARC-Challenge | 25  | **acc\_norm** | `0.546` | `0.528` | `0.563` | `0.613` |
+| TruthfulQA | 6   | **prob\_mass** | `0.547` | `0.566` | `0.542` | `0.635` |
+| GSM8K | 5   | **acc** | `0.014` | `0.163` | `0.573` | `0.488` |
+| MMLU | 5   | **acc** | `0.484` | `0.525` | `0.659` | `0.624` |
+| HellaSwag | 10  | **acc\_norm** | `0.646` | `0.761` | `0.779` | `0.826` |
+| Winogrande | 5   | **acc** | `0.651` | `0.643` | `0.732` | `0.784` |
+| 2.  **General Knowledge: Multilingual** |     |     |     |     |     |     |
+| Lambada Multilingual: en, fr, de, it, es | 10  | **acc** | `0.340` | `0.525` | `0.540` | `0.589` |
+| ARC-Challenge-DE | 25  | **acc\_norm** | `0.486` | `0.486` | `0.459` | `0.475` |
+| HellaSwag-DE | 10  | **acc\_norm** | `0.487` | `0.633` | `0.598` | `0.583` |
+| MMLU-DE | 5   | **acc** | `0.428` | `0.488` | `0.589` | `0.537` |
+| TruthfulQA-DE | 6   | **prob\_mass** | `0.561` | `0.576` | `0.509` | `0.623` |
+| 3.  **Translation** |     |     |     |     |     |     |
+| WMT14 | 5   | **bleu, chrf, ter** | `32.66`, `61.32`, `53.77` | `33.07`, `61.73`, `53.14` | `35.77`, `63.08`, `50.02` | `33.29`, `61.49`, `52.56` |
+| WMT16 | 5   | **bleu, chrf, ter** | `30.59`, `60.36`, `56.62` | `31.64`, `61.18`, `55.48` | `34.24`, `62.69`, `51.95` | `31.13`, `60.34`, `56.25` |
+| WMT20 | 5   | **bleu, chrf, ter** | `26.60`, `58.57`, `63.09` | `26.65`, `58.82`, `63.37` | `28.12`, `59.60`, `59.73` | `26.32`, `58.06`, `61.81` |
+| 4.  **Expert Domain: Law** |     |     |     |     |     |     |
+| Legal-Sentence-Classification-Dataset | 5   | **acc** | `0.315` | `0.357` | `0.424` | `0.418` |
+| LexGlue Case-Hold | 5   | **acc\_norm** | `0.268` | `0.282` | `0.297` | `0.303` |
+| MMLU Law | 5   | **acc** | `0.465` | `0.524` | `0.689` | `0.674` |
+| MMLU-DE Law | 5   | **acc** | `0.439` | `0.516` | `0.626` | `0.560` |
+| 5.  **Expert Domain: Engineering** |     |     |     |     |     |     |
+| MMLU Engineering | 5   | **acc** | `0.401` | `0.431` | `0.624` | `0.595` |
+| MMLU-DE Engineering | 5   | **acc** | `0.389` | `0.426` | `0.529` | `0.533` |
 # Training Details