Text Generation
scaling

Include general knowledge benchmarks

#4
Files changed (1) hide show
  1. README.md +31 -4
README.md CHANGED
@@ -235,12 +235,11 @@ While performing in the same ballpark as `llama-3.1-8b-instruct`, `Pharia-1-LLM-
235
  | | | | | |
236
  | --- | --- | --- | --- | --- |
237
  | **Model** | **Quality DE**, 1 (bad) to 5 (great) | **Quality EN**, 1 (bad) to 5 (great) | **Concise**, in % | **Instruction following**, in % |
238
- | `llama-3.1-8b-instruct` | **3.62** | **4.01** | 89.7 | **83.6** |
239
  | `Pharia-1-LLM-7B-control` | 3.60 | 4.00 | **91.9** | 81.8 |
 
240
  | `Mistral-7B-Instruct-v0.3` | 3.47 | 3.88 | 88.5 | 80.4 |
241
 
242
- **Note:** We will add the engineering benchmark evaluations for `Pharia-1-LLM-7B-control-aligned` shortly.
243
-
244
  #### Performance on length-controlled completions
245
 
246
  “Absolute normalized distance to target” measures how much a model’s completions deviate from the desired length, calculated as:
@@ -278,7 +277,35 @@ We assessed each model’s ability to produce safe answers given prompts that te
278
 
279
  ### General Knowledge Benchmarks
280
 
281
- We acknowledge that while generic accuracy-based benchmarks such as [Open LLM Leaderboard v1](https://Hugging Face.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) provide a reproducible comparability of model performance, they have been designed for evaluation of pre-trained models and should not be mistaken for strong indicators of use-case-specific performance. In contrast to what [some research](https://arxiv.org/abs/2405.00332) might suggest for other models, our Pharia-1-LLM-7B models have not been tailored to such generic benchmarks, and naturally would be expected to underperform in these. We will continue to transparently evaluate also against these generic benchmarks and share the results here shortly.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
282
 
283
  # Training Details
284
 
 
235
  | | | | | |
236
  | --- | --- | --- | --- | --- |
237
  | **Model** | **Quality DE**, 1 (bad) to 5 (great) | **Quality EN**, 1 (bad) to 5 (great) | **Concise**, in % | **Instruction following**, in % |
238
+ | `llama-3.1-8b-instruct` | **3.62** | 4.01 | 89.7 | **83.6** |
239
  | `Pharia-1-LLM-7B-control` | 3.60 | 4.00 | **91.9** | 81.8 |
240
+ | `Pharia-1-LLM-7B-control-aligned` | 3.51 | **4.08** | 81.8 | 77.7 |
241
  | `Mistral-7B-Instruct-v0.3` | 3.47 | 3.88 | 88.5 | 80.4 |
242
 
 
 
243
  #### Performance on length-controlled completions
244
 
245
  “Absolute normalized distance to target” measures how much a model’s completions deviate from the desired length, calculated as:
 
277
 
278
  ### General Knowledge Benchmarks
279
 
280
+ We acknowledge that while generic accuracy-based benchmarks such as [Open LLM Leaderboard v1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) provide a reproducible comparability of model performance, they have been designed for evaluation of pre-trained models and should not be mistaken for strong indicators of use-case-specific performance. In contrast to what [some research](https://arxiv.org/abs/2405.00332) might suggest for other models, our Pharia-1-LLM-7B models have not been tailored to such generic benchmarks, and naturally would be expected to underperform in these.
281
+
282
+ | **Benchmark** | **Shots** | **Metric** | **Pharia-1-LLM-7B-control** | **Pharia-1-LLM-7B-control-aligned** | **Llama-3.1-8B-Instruct** | **Mistral-7B-Instruct-v0.3** |
283
+ | --- | --- | --- | --- | --- | --- | --- |
284
+ | 1. **General Knowledge:** [**Open LLM Leaderboard V1**](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) | | | | | | |
285
+ | ARC-Challenge | 25 | **acc\_norm** | `0.546` | `0.528` | `0.563` | `0.613` |
286
+ | TruthfulQA | 6 | **prob\_mass** | `0.547` | `0.566` | `0.542` | `0.635` |
287
+ | GSM8K | 5 | **acc** | `0.014` | `0.163` | `0.573` | `0.488` |
288
+ | MMLU | 5 | **acc** | `0.484` | `0.525` | `0.659` | `0.624` |
289
+ | HellaSwag | 10 | **acc\_norm** | `0.646` | `0.761` | `0.779` | `0.826` |
290
+ | Winogrande | 5 | **acc** | `0.651` | `0.643` | `0.732` | `0.784` |
291
+ | 2. **General Knowledge: Multilingual** | | | | | | |
292
+ | Lambada Multilingual: en, fr, de, it, es | 10 | **acc** | `0.340` | `0.525` | `0.540` | `0.589` |
293
+ | ARC-Challenge-DE | 25 | **acc\_norm** | `0.486` | `0.486` | `0.459` | `0.475` |
294
+ | HellaSwag-DE | 10 | **acc\_norm** | `0.487` | `0.633` | `0.598` | `0.583` |
295
+ | MMLU-DE | 5 | **acc** | `0.428` | `0.488` | `0.589` | `0.537` |
296
+ | TruthfulQA-DE | 6 | **prob\_mass** | `0.561` | `0.576` | `0.509` | `0.623` |
297
+ | 3. **Translation** | | | | | | |
298
+ | WMT14 | 5 | **bleu, chrf, ter** | `32.66`, `61.32`, `53.77` | `33.07`, `61.73`, `53.14` | `35.77`, `63.08`, `50.02` | `33.29`, `61.49`, `52.56` |
299
+ | WMT16 | 5 | **bleu, chrf, ter** | `30.59`, `60.36`, `56.62` | `31.64`, `61.18`, `55.48` | `34.24`, `62.69`, `51.95` | `31.13`, `60.34`, `56.25` |
300
+ | WMT20 | 5 | **bleu, chrf, ter** | `26.60`, `58.57`, `63.09` | `26.65`, `58.82`, `63.37` | `28.12`, `59.60`, `59.73` | `26.32`, `58.06`, `61.81` |
301
+ | 4. **Expert Domain: Law** | | | | | | |
302
+ | Legal-Sentence-Classification-Dataset | 5 | **acc** | `0.315` | `0.357` | `0.424` | `0.418` |
303
+ | LexGlue Case-Hold | 5 | **acc\_norm** | `0.268` | `0.282` | `0.297` | `0.303` |
304
+ | MMLU Law | 5 | **acc** | `0.465` | `0.524` | `0.689` | `0.674` |
305
+ | MMLU-DE Law | 5 | **acc** | `0.439` | `0.516` | `0.626` | `0.560` |
306
+ | 5. **Expert Domain: Engineering** | | | | | | |
307
+ | MMLU Engineering | 5 | **acc** | `0.401` | `0.431` | `0.624` | `0.595` |
308
+ | MMLU-DE Engineering | 5 | **acc** | `0.389` | `0.426` | `0.529` | `0.533` |
309
 
310
  # Training Details
311