norallm
/

norbloom-7b-scratch

@@ -84,38 +84,7 @@ The user should perform evaluation for their particular model application scenar
 The perplexity on the heldout [validation set from the Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC) is 7.43 and the final training perplexity is 4.76.
 Our initial downstream evaluation is conducted on reading comprehension, sentiment analysis and machine translation tasks using open-source peer-reviewed datasets and benchmarks in native Norwegian.
-We release [our codebase here](https://github.com/ltgoslo/norallm). We compare against other pretrained generative language models that officially support Norwegian: [NB-GPT-J](https://huggingface.co/NbAiLab/nb-gpt-j-6B), [GPT-Sw3 6.7B](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b), [GPT-Sw3 6.7B v2](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2), and [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b).
-### Reading comprehension
-[NorQuAD](https://huggingface.co/datasets/ltg/norquad) ([Ivanova et al., 2023](https://aclanthology.org/2023.nodalida-1.17/)) is a dataset for extractive question answering in Norwegian designed similarly to [SQuAD (Rajpurkar et al., 2016)](https://aclanthology.org/D16-1264/).
-<details>
-<summary>Method</summary>
-* Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
-* Prompt: ```"Tittel: {title}\n\nTekst: {text}\n\nSpørsmål: {question}\n\nSvar:{answer}"```
-* Few-shot results show the average scores across 5 repetitions
-* Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/norquad.py
-* Performance metrics: macro-averaged F1-score and exact match (EM).
-</details>
-<details open>
-<summary>Performance results on the extractive question answering task (NorQuAD)</summary>
-|Model|0-shot (F1/EM)|1-shot (F1/EM)|2-shot (F1/EM)|
-|---|---|---|---|
-|NorMistral-7b-warm|**48.6**/**24.8**|**63.6**/**40.0**|**66.5**/43.8|
-|NorMistral-7b-scratch|34.0/15.7|46.5/25.8|48.5/27.8|
-|NorBLOOM-7b|35.0/13.3|47.7/28.0|49.3/30.1|
-|NB-GPT-J|24.4/6.8|32.8/11.6|35.0/12.3|
-|Falcon-7B|15.8/7.0|27.3/13.9|27.4/13.1|
-|GPT-Sw3-6.7B|46.5/22.0|55.9/32.0|58.1/34.3|
-|GPT-Sw3-6.7B-v2|46.9/22.5|61.1/38.9|66.0/**44.5**|
-</details>
 ### Sentiment analysis
@@ -127,7 +96,7 @@ We use the binary formulation of this task (positive vs. negative).
 <summary>Method</summary>
 * Evaluation setting: zero-shot and few-shot perplexity-based evaluation.
-* Prompt: ```"Tekst: {text}\nSentiment:{label}"```, where the ```label``` is either "positiv" or "negativ".
 * Few-shot results show the average scores across 5 repetitions
 * Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/sentiment_analysis.py
 * Performance metric: macro-averaged F1-score.
@@ -143,12 +112,48 @@ We use the binary formulation of this task (positive vs. negative).
 |NorMistral-7b-scratch|47.3|62.2|80.1|
 |NorBLOOM-7b|**75.7**|73.8|65.5|
 |NB-GPT-J|48.4|56.5|65.2|
-|Falcon-7B|53.3|61.6|74.9|
 |GPT-Sw3-6.7B|61.5|72.2|76.5|
 |GPT-Sw3-6.7B-v2|42.4|69.1|83.4|
 </details>
 ### Machine translation
 [Tatoeba](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt) [(Tiedemann, 2020)](https://aclanthology.org/2020.wmt-1.139/) is a benchmark for machine translation, which includes hundreds of language pairs. We consider six language pairs (English <-> Bokmål, English <-> Nynorsk, and Bokmål <-> Nynorsk).
@@ -157,7 +162,7 @@ We use the binary formulation of this task (positive vs. negative).
 <summary>Method</summary>
 * Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
-* Prompt: ```"{source_language}: {source_text}\n{target_language}:{target_text}"```, where the ```source_language``` and ```target_language``` are ```Engelsk```, ```Bokmål```, or ```Nynorsk```.
 * Few-shot results show the average scores across 5 repetitions
 * Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/machine_translation.py
 * Performance metrics: BLEU ([Papineni et al., 2002](https://aclanthology.org/P02-1040/)) and chrF++ ([Popović, 2015](https://aclanthology.org/W15-3049/)).
@@ -173,9 +178,11 @@ We use the binary formulation of this task (positive vs. negative).
 |NorMistral-7b-scratch|46.4/62.9|50.4/66.3|52.1/67.6|
 |NorBLOOM-7b|37.1/53.6|50.1/65.8|52.0/67.6|
 |NB-GPT-J|8.6/39.1|35.9/64.5|47.2/68.7|
-|Falcon-7B|19.1/40.1|20.6/41.8|22.1/43.6|
 |GPT-Sw3-6.7B|21.8/55.2|54.5/69.6|**58.6**/**73.2**|
 |GPT-Sw3-6.7B-v2|20.6/53.2|51.2/66.6|58.4/73.0|
 </details>
@@ -188,9 +195,11 @@ We use the binary formulation of this task (positive vs. negative).
 |NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
 |NorBLOOM-7b|35.6/54.7|36.6/56.3|38.1/57.4|
 |NB-GPT-J|1.7/14.7|6.3/34.1|35.2/60.4|
-|Falcon-7B|6.4/28.6|8.3/30.5|9.3/32.1|
 |GPT-Sw3-6.7B|13.4/44.3|43.6/62.5|**44.5**/63.5|
 |GPT-Sw3-6.7B-v2|14.8/45.5|43.7/62.3|44.0/63.6|
 </details>
@@ -204,9 +213,11 @@ We use the binary formulation of this task (positive vs. negative).
 |NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
 |NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
 |NB-GPT-J|9.8/41.4|24.8/58.3|47.6/67.7|
-|Falcon-7B|21.6/40.6|31.7/47.4|36.6/51.7|
 |GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
 |GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
 </details>
@@ -219,9 +230,10 @@ We use the binary formulation of this task (positive vs. negative).
 |NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
 |NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
 |NB-GPT-J|2.9/19.5|10.1/41.0|44.4/66.9|
-|Falcon-7B|21.6/40.6|31.7/47.4|36.6/57.1|
 |GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
 |GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
 </details>
@@ -235,9 +247,11 @@ We use the binary formulation of this task (positive vs. negative).
 |NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
 |NorBLOOM-7b|71.5/84.4|70.1/84.1|71.9/85.1|
 |NB-GPT-J|6.6/35.5|9.6/41.0|26.0/64.7|
-|Falcon-7B|28.7/59.2|29.8/60.8|32.1/62.3|
 |GPT-Sw3-6.7B|63.6/82.8|74.7/86.0|75.8/86.9|
 |GPT-Sw3-6.7B-v2|57.5/81.1|**75.3**/86.7|**76.7**/**87.6**|
 </details>
@@ -250,9 +264,10 @@ We use the binary formulation of this task (positive vs. negative).
 |NorMistral-7b-scratch|85.1/91.4|86.6/92.4|87.4/93.0|
 |NorBLOOM-7b|78.7/88.5|84.2/90.7|87.4/93.0|
 |NB-GPT-J|2.7/18.5|6.9/35.6|52.9/84.3|
-|Falcon-7B|36.7/61.6|38.3/63.5|45.8/68.1|
 |GPT-Sw3-6.7B|652.3/82.4|86.1/92.5|87.8/93.6|
 |GPT-Sw3-6.7B-v2|72.0/88.6|86.1/92.5|88.2/93.9|
 </details>

 The perplexity on the heldout [validation set from the Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC) is 7.43 and the final training perplexity is 4.76.
 Our initial downstream evaluation is conducted on reading comprehension, sentiment analysis and machine translation tasks using open-source peer-reviewed datasets and benchmarks in native Norwegian.
+We release [our codebase here](https://github.com/ltgoslo/norallm). We compare against other pretrained generative language models that officially support Norwegian: [NB-GPT-J](https://huggingface.co/NbAiLab/nb-gpt-j-6B), [GPT-Sw3 6.7B](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b), [GPT-Sw3 6.7B v2](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2), and [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b); we also include evaluation of [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1).
 ### Sentiment analysis
 <summary>Method</summary>
 * Evaluation setting: zero-shot and few-shot perplexity-based evaluation.
+* Prompt: ```"Tekst: {text}\nSentiment:{label}"```, where the ```label``` is either "positiv" or "negativ". Based on [Brown et al. (2020)](https://arxiv.org/abs/2005.14165).
 * Few-shot results show the average scores across 5 repetitions
 * Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/sentiment_analysis.py
 * Performance metric: macro-averaged F1-score.
 |NorMistral-7b-scratch|47.3|62.2|80.1|
 |NorBLOOM-7b|**75.7**|73.8|65.5|
 |NB-GPT-J|48.4|56.5|65.2|
 |GPT-Sw3-6.7B|61.5|72.2|76.5|
 |GPT-Sw3-6.7B-v2|42.4|69.1|83.4|
+|Falcon-7B|53.3|61.6|74.9|
+|Mistral-7B-v0.1|70.2|72.9|84.8|
 </details>
+### Reading comprehension
+[NorQuAD](https://huggingface.co/datasets/ltg/norquad) ([Ivanova et al., 2023](https://aclanthology.org/2023.nodalida-1.17/)) is a dataset for extractive question answering in Norwegian designed similarly to [SQuAD (Rajpurkar et al., 2016)](https://aclanthology.org/D16-1264/).
+<details>
+<summary>Method</summary>
+* Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
+* Prompt: ```"Tittel: {title}\n\nTekst: {text}\n\nSpørsmål: {question}\n\nSvar:{answer}"```
+* Few-shot results show the average scores across 5 repetitions
+* Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/norquad.py
+* Performance metrics: macro-averaged F1-score and exact match (EM).
+</details>
+<details open>
+<summary>Performance results on the extractive question answering task (NorQuAD)</summary>
+|Model|0-shot (F1/EM)|1-shot (F1/EM)|2-shot (F1/EM)|
+|---|---|---|---|
+|NorMistral-7b-warm|**48.6**/**24.8**|63.6/40.0|66.5/43.8|
+|NorMistral-7b-scratch|34.0/15.7|46.5/25.8|48.5/27.8|
+|NorBLOOM-7b|35.0/13.3|47.7/28.0|49.3/30.1|
+|NB-GPT-J|24.4/6.8|32.8/11.6|35.0/12.3|
+|GPT-Sw3-6.7B|46.5/22.0|55.9/32.0|58.1/34.3|
+|GPT-Sw3-6.7B-v2|46.9/22.5|61.1/38.9|66.0/44.5|
+|Falcon-7B|15.8/7.0|27.3/13.9|27.4/13.1|
+|Mistral-7B-v0.1|46.4/22.4|**64.9**/**41.1**|**71.7**/**49.4**|
+</details>
 ### Machine translation
 [Tatoeba](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt) [(Tiedemann, 2020)](https://aclanthology.org/2020.wmt-1.139/) is a benchmark for machine translation, which includes hundreds of language pairs. We consider six language pairs (English <-> Bokmål, English <-> Nynorsk, and Bokmål <-> Nynorsk).
 <summary>Method</summary>
 * Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
+* Prompt: ```"{source_language}: {source_text}\n{target_language}:{target_text}"```, where the ```source_language``` and ```target_language``` are ```Engelsk```, ```Bokmål```, or ```Nynorsk```. Based on [Garcia et al. (2023)](https://arxiv.org/abs/2302.01398).
 * Few-shot results show the average scores across 5 repetitions
 * Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/machine_translation.py
 * Performance metrics: BLEU ([Papineni et al., 2002](https://aclanthology.org/P02-1040/)) and chrF++ ([Popović, 2015](https://aclanthology.org/W15-3049/)).
 |NorMistral-7b-scratch|46.4/62.9|50.4/66.3|52.1/67.6|
 |NorBLOOM-7b|37.1/53.6|50.1/65.8|52.0/67.6|
 |NB-GPT-J|8.6/39.1|35.9/64.5|47.2/68.7|
 |GPT-Sw3-6.7B|21.8/55.2|54.5/69.6|**58.6**/**73.2**|
 |GPT-Sw3-6.7B-v2|20.6/53.2|51.2/66.6|58.4/73.0|
+|Falcon-7B|19.1/40.1|20.6/41.8|22.1/43.6|
+|Mistral-7B-v0.1|32.5/51.9|35.4/55.1|36.3/56.0|
 </details>
 |NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
 |NorBLOOM-7b|35.6/54.7|36.6/56.3|38.1/57.4|
 |NB-GPT-J|1.7/14.7|6.3/34.1|35.2/60.4|
 |GPT-Sw3-6.7B|13.4/44.3|43.6/62.5|**44.5**/63.5|
 |GPT-Sw3-6.7B-v2|14.8/45.5|43.7/62.3|44.0/63.6|
+|Falcon-7B|6.4/28.6|8.3/30.5|9.3/32.1|
+|Mistral-7B-v0.1|11.6/35.7|13.5/38.7|15.0/40.0|
 </details>
 |NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
 |NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
 |NB-GPT-J|9.8/41.4|24.8/58.3|47.6/67.7|
 |GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
 |GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
+|Falcon-7B|21.6/40.6|31.7/47.4|36.6/51.7|
+|Mistral-7B-v0.1|53.8/68.2|54.6/69.0|56.9/70.7|
 </details>
 |NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
 |NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
 |NB-GPT-J|2.9/19.5|10.1/41.0|44.4/66.9|
 |GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
 |GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
+|Falcon-7B|21.6/40.6|31.7/47.4|36.6/57.1|
+|Mistral-7B-v0.1|40.7/57.1|46.2/60.7|49.9/63.8|
 </details>
 |NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
 |NorBLOOM-7b|71.5/84.4|70.1/84.1|71.9/85.1|
 |NB-GPT-J|6.6/35.5|9.6/41.0|26.0/64.7|
 |GPT-Sw3-6.7B|63.6/82.8|74.7/86.0|75.8/86.9|
 |GPT-Sw3-6.7B-v2|57.5/81.1|**75.3**/86.7|**76.7**/**87.6**|
+|Falcon-7B|28.7/59.2|29.8/60.8|32.1/62.3|
+|Mistral-7B-v0.1|32.0/62.2|32.9/62.6|35.2/63.9|
 </details>
 |NorMistral-7b-scratch|85.1/91.4|86.6/92.4|87.4/93.0|
 |NorBLOOM-7b|78.7/88.5|84.2/90.7|87.4/93.0|
 |NB-GPT-J|2.7/18.5|6.9/35.6|52.9/84.3|
 |GPT-Sw3-6.7B|652.3/82.4|86.1/92.5|87.8/93.6|
 |GPT-Sw3-6.7B-v2|72.0/88.6|86.1/92.5|88.2/93.9|
+|Falcon-7B|36.7/61.6|38.3/63.5|45.8/68.1|
+|Mistral-7B-v0.1|57.0/74.8|59.9/77.5|62.6/79.1|
 </details>