Added Mistral results
Browse files
README.md
CHANGED
@@ -84,38 +84,7 @@ The user should perform evaluation for their particular model application scenar
|
|
84 |
The perplexity on the heldout [validation set from the Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC) is 7.43 and the final training perplexity is 4.76.
|
85 |
|
86 |
Our initial downstream evaluation is conducted on reading comprehension, sentiment analysis and machine translation tasks using open-source peer-reviewed datasets and benchmarks in native Norwegian.
|
87 |
-
We release [our codebase here](https://github.com/ltgoslo/norallm). We compare against other pretrained generative language models that officially support Norwegian: [NB-GPT-J](https://huggingface.co/NbAiLab/nb-gpt-j-6B), [GPT-Sw3 6.7B](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b), [GPT-Sw3 6.7B v2](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2), and [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b).
|
88 |
-
|
89 |
-
|
90 |
-
### Reading comprehension
|
91 |
-
|
92 |
-
[NorQuAD](https://huggingface.co/datasets/ltg/norquad) ([Ivanova et al., 2023](https://aclanthology.org/2023.nodalida-1.17/)) is a dataset for extractive question answering in Norwegian designed similarly to [SQuAD (Rajpurkar et al., 2016)](https://aclanthology.org/D16-1264/).
|
93 |
-
|
94 |
-
<details>
|
95 |
-
<summary>Method</summary>
|
96 |
-
|
97 |
-
* Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
|
98 |
-
* Prompt: ```"Tittel: {title}\n\nTekst: {text}\n\nSpørsmål: {question}\n\nSvar:{answer}"```
|
99 |
-
* Few-shot results show the average scores across 5 repetitions
|
100 |
-
* Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/norquad.py
|
101 |
-
* Performance metrics: macro-averaged F1-score and exact match (EM).
|
102 |
-
|
103 |
-
</details>
|
104 |
-
|
105 |
-
<details open>
|
106 |
-
<summary>Performance results on the extractive question answering task (NorQuAD)</summary>
|
107 |
-
|
108 |
-
|Model|0-shot (F1/EM)|1-shot (F1/EM)|2-shot (F1/EM)|
|
109 |
-
|---|---|---|---|
|
110 |
-
|NorMistral-7b-warm|**48.6**/**24.8**|**63.6**/**40.0**|**66.5**/43.8|
|
111 |
-
|NorMistral-7b-scratch|34.0/15.7|46.5/25.8|48.5/27.8|
|
112 |
-
|NorBLOOM-7b|35.0/13.3|47.7/28.0|49.3/30.1|
|
113 |
-
|NB-GPT-J|24.4/6.8|32.8/11.6|35.0/12.3|
|
114 |
-
|Falcon-7B|15.8/7.0|27.3/13.9|27.4/13.1|
|
115 |
-
|GPT-Sw3-6.7B|46.5/22.0|55.9/32.0|58.1/34.3|
|
116 |
-
|GPT-Sw3-6.7B-v2|46.9/22.5|61.1/38.9|66.0/**44.5**|
|
117 |
-
|
118 |
-
</details>
|
119 |
|
120 |
|
121 |
### Sentiment analysis
|
@@ -127,7 +96,7 @@ We use the binary formulation of this task (positive vs. negative).
|
|
127 |
<summary>Method</summary>
|
128 |
|
129 |
* Evaluation setting: zero-shot and few-shot perplexity-based evaluation.
|
130 |
-
* Prompt: ```"Tekst: {text}\nSentiment:{label}"```, where the ```label``` is either "positiv" or "negativ".
|
131 |
* Few-shot results show the average scores across 5 repetitions
|
132 |
* Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/sentiment_analysis.py
|
133 |
* Performance metric: macro-averaged F1-score.
|
@@ -143,12 +112,48 @@ We use the binary formulation of this task (positive vs. negative).
|
|
143 |
|NorMistral-7b-scratch|47.3|62.2|80.1|
|
144 |
|NorBLOOM-7b|**75.7**|73.8|65.5|
|
145 |
|NB-GPT-J|48.4|56.5|65.2|
|
146 |
-
|Falcon-7B|53.3|61.6|74.9|
|
147 |
|GPT-Sw3-6.7B|61.5|72.2|76.5|
|
148 |
|GPT-Sw3-6.7B-v2|42.4|69.1|83.4|
|
|
|
|
|
149 |
|
150 |
</details>
|
151 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
152 |
### Machine translation
|
153 |
|
154 |
[Tatoeba](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt) [(Tiedemann, 2020)](https://aclanthology.org/2020.wmt-1.139/) is a benchmark for machine translation, which includes hundreds of language pairs. We consider six language pairs (English <-> Bokmål, English <-> Nynorsk, and Bokmål <-> Nynorsk).
|
@@ -157,7 +162,7 @@ We use the binary formulation of this task (positive vs. negative).
|
|
157 |
<summary>Method</summary>
|
158 |
|
159 |
* Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
|
160 |
-
* Prompt: ```"{source_language}: {source_text}\n{target_language}:{target_text}"```, where the ```source_language``` and ```target_language``` are ```Engelsk```, ```Bokmål```, or ```Nynorsk```.
|
161 |
* Few-shot results show the average scores across 5 repetitions
|
162 |
* Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/machine_translation.py
|
163 |
* Performance metrics: BLEU ([Papineni et al., 2002](https://aclanthology.org/P02-1040/)) and chrF++ ([Popović, 2015](https://aclanthology.org/W15-3049/)).
|
@@ -173,9 +178,11 @@ We use the binary formulation of this task (positive vs. negative).
|
|
173 |
|NorMistral-7b-scratch|46.4/62.9|50.4/66.3|52.1/67.6|
|
174 |
|NorBLOOM-7b|37.1/53.6|50.1/65.8|52.0/67.6|
|
175 |
|NB-GPT-J|8.6/39.1|35.9/64.5|47.2/68.7|
|
176 |
-
|Falcon-7B|19.1/40.1|20.6/41.8|22.1/43.6|
|
177 |
|GPT-Sw3-6.7B|21.8/55.2|54.5/69.6|**58.6**/**73.2**|
|
178 |
|GPT-Sw3-6.7B-v2|20.6/53.2|51.2/66.6|58.4/73.0|
|
|
|
|
|
|
|
179 |
|
180 |
</details>
|
181 |
|
@@ -188,9 +195,11 @@ We use the binary formulation of this task (positive vs. negative).
|
|
188 |
|NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
|
189 |
|NorBLOOM-7b|35.6/54.7|36.6/56.3|38.1/57.4|
|
190 |
|NB-GPT-J|1.7/14.7|6.3/34.1|35.2/60.4|
|
191 |
-
|Falcon-7B|6.4/28.6|8.3/30.5|9.3/32.1|
|
192 |
|GPT-Sw3-6.7B|13.4/44.3|43.6/62.5|**44.5**/63.5|
|
193 |
|GPT-Sw3-6.7B-v2|14.8/45.5|43.7/62.3|44.0/63.6|
|
|
|
|
|
|
|
194 |
|
195 |
</details>
|
196 |
|
@@ -204,9 +213,11 @@ We use the binary formulation of this task (positive vs. negative).
|
|
204 |
|NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
|
205 |
|NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
|
206 |
|NB-GPT-J|9.8/41.4|24.8/58.3|47.6/67.7|
|
207 |
-
|Falcon-7B|21.6/40.6|31.7/47.4|36.6/51.7|
|
208 |
|GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
|
209 |
|GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
|
|
|
|
|
|
|
210 |
|
211 |
</details>
|
212 |
|
@@ -219,9 +230,10 @@ We use the binary formulation of this task (positive vs. negative).
|
|
219 |
|NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
|
220 |
|NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
|
221 |
|NB-GPT-J|2.9/19.5|10.1/41.0|44.4/66.9|
|
222 |
-
|Falcon-7B|21.6/40.6|31.7/47.4|36.6/57.1|
|
223 |
|GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
|
224 |
|GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
|
|
|
|
|
225 |
|
226 |
</details>
|
227 |
|
@@ -235,9 +247,11 @@ We use the binary formulation of this task (positive vs. negative).
|
|
235 |
|NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
|
236 |
|NorBLOOM-7b|71.5/84.4|70.1/84.1|71.9/85.1|
|
237 |
|NB-GPT-J|6.6/35.5|9.6/41.0|26.0/64.7|
|
238 |
-
|Falcon-7B|28.7/59.2|29.8/60.8|32.1/62.3|
|
239 |
|GPT-Sw3-6.7B|63.6/82.8|74.7/86.0|75.8/86.9|
|
240 |
|GPT-Sw3-6.7B-v2|57.5/81.1|**75.3**/86.7|**76.7**/**87.6**|
|
|
|
|
|
|
|
241 |
|
242 |
</details>
|
243 |
|
@@ -250,9 +264,10 @@ We use the binary formulation of this task (positive vs. negative).
|
|
250 |
|NorMistral-7b-scratch|85.1/91.4|86.6/92.4|87.4/93.0|
|
251 |
|NorBLOOM-7b|78.7/88.5|84.2/90.7|87.4/93.0|
|
252 |
|NB-GPT-J|2.7/18.5|6.9/35.6|52.9/84.3|
|
253 |
-
|Falcon-7B|36.7/61.6|38.3/63.5|45.8/68.1|
|
254 |
|GPT-Sw3-6.7B|652.3/82.4|86.1/92.5|87.8/93.6|
|
255 |
|GPT-Sw3-6.7B-v2|72.0/88.6|86.1/92.5|88.2/93.9|
|
|
|
|
|
256 |
|
257 |
</details>
|
258 |
|
|
|
84 |
The perplexity on the heldout [validation set from the Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC) is 7.43 and the final training perplexity is 4.76.
|
85 |
|
86 |
Our initial downstream evaluation is conducted on reading comprehension, sentiment analysis and machine translation tasks using open-source peer-reviewed datasets and benchmarks in native Norwegian.
|
87 |
+
We release [our codebase here](https://github.com/ltgoslo/norallm). We compare against other pretrained generative language models that officially support Norwegian: [NB-GPT-J](https://huggingface.co/NbAiLab/nb-gpt-j-6B), [GPT-Sw3 6.7B](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b), [GPT-Sw3 6.7B v2](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2), and [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b); we also include evaluation of [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
88 |
|
89 |
|
90 |
### Sentiment analysis
|
|
|
96 |
<summary>Method</summary>
|
97 |
|
98 |
* Evaluation setting: zero-shot and few-shot perplexity-based evaluation.
|
99 |
+
* Prompt: ```"Tekst: {text}\nSentiment:{label}"```, where the ```label``` is either "positiv" or "negativ". Based on [Brown et al. (2020)](https://arxiv.org/abs/2005.14165).
|
100 |
* Few-shot results show the average scores across 5 repetitions
|
101 |
* Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/sentiment_analysis.py
|
102 |
* Performance metric: macro-averaged F1-score.
|
|
|
112 |
|NorMistral-7b-scratch|47.3|62.2|80.1|
|
113 |
|NorBLOOM-7b|**75.7**|73.8|65.5|
|
114 |
|NB-GPT-J|48.4|56.5|65.2|
|
|
|
115 |
|GPT-Sw3-6.7B|61.5|72.2|76.5|
|
116 |
|GPT-Sw3-6.7B-v2|42.4|69.1|83.4|
|
117 |
+
|Falcon-7B|53.3|61.6|74.9|
|
118 |
+
|Mistral-7B-v0.1|70.2|72.9|84.8|
|
119 |
|
120 |
</details>
|
121 |
|
122 |
+
|
123 |
+
|
124 |
+
### Reading comprehension
|
125 |
+
|
126 |
+
[NorQuAD](https://huggingface.co/datasets/ltg/norquad) ([Ivanova et al., 2023](https://aclanthology.org/2023.nodalida-1.17/)) is a dataset for extractive question answering in Norwegian designed similarly to [SQuAD (Rajpurkar et al., 2016)](https://aclanthology.org/D16-1264/).
|
127 |
+
|
128 |
+
<details>
|
129 |
+
<summary>Method</summary>
|
130 |
+
|
131 |
+
* Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
|
132 |
+
* Prompt: ```"Tittel: {title}\n\nTekst: {text}\n\nSpørsmål: {question}\n\nSvar:{answer}"```
|
133 |
+
* Few-shot results show the average scores across 5 repetitions
|
134 |
+
* Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/norquad.py
|
135 |
+
* Performance metrics: macro-averaged F1-score and exact match (EM).
|
136 |
+
|
137 |
+
</details>
|
138 |
+
|
139 |
+
<details open>
|
140 |
+
<summary>Performance results on the extractive question answering task (NorQuAD)</summary>
|
141 |
+
|
142 |
+
|Model|0-shot (F1/EM)|1-shot (F1/EM)|2-shot (F1/EM)|
|
143 |
+
|---|---|---|---|
|
144 |
+
|NorMistral-7b-warm|**48.6**/**24.8**|63.6/40.0|66.5/43.8|
|
145 |
+
|NorMistral-7b-scratch|34.0/15.7|46.5/25.8|48.5/27.8|
|
146 |
+
|NorBLOOM-7b|35.0/13.3|47.7/28.0|49.3/30.1|
|
147 |
+
|NB-GPT-J|24.4/6.8|32.8/11.6|35.0/12.3|
|
148 |
+
|GPT-Sw3-6.7B|46.5/22.0|55.9/32.0|58.1/34.3|
|
149 |
+
|GPT-Sw3-6.7B-v2|46.9/22.5|61.1/38.9|66.0/44.5|
|
150 |
+
|Falcon-7B|15.8/7.0|27.3/13.9|27.4/13.1|
|
151 |
+
|Mistral-7B-v0.1|46.4/22.4|**64.9**/**41.1**|**71.7**/**49.4**|
|
152 |
+
|
153 |
+
</details>
|
154 |
+
|
155 |
+
|
156 |
+
|
157 |
### Machine translation
|
158 |
|
159 |
[Tatoeba](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt) [(Tiedemann, 2020)](https://aclanthology.org/2020.wmt-1.139/) is a benchmark for machine translation, which includes hundreds of language pairs. We consider six language pairs (English <-> Bokmål, English <-> Nynorsk, and Bokmål <-> Nynorsk).
|
|
|
162 |
<summary>Method</summary>
|
163 |
|
164 |
* Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
|
165 |
+
* Prompt: ```"{source_language}: {source_text}\n{target_language}:{target_text}"```, where the ```source_language``` and ```target_language``` are ```Engelsk```, ```Bokmål```, or ```Nynorsk```. Based on [Garcia et al. (2023)](https://arxiv.org/abs/2302.01398).
|
166 |
* Few-shot results show the average scores across 5 repetitions
|
167 |
* Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/machine_translation.py
|
168 |
* Performance metrics: BLEU ([Papineni et al., 2002](https://aclanthology.org/P02-1040/)) and chrF++ ([Popović, 2015](https://aclanthology.org/W15-3049/)).
|
|
|
178 |
|NorMistral-7b-scratch|46.4/62.9|50.4/66.3|52.1/67.6|
|
179 |
|NorBLOOM-7b|37.1/53.6|50.1/65.8|52.0/67.6|
|
180 |
|NB-GPT-J|8.6/39.1|35.9/64.5|47.2/68.7|
|
|
|
181 |
|GPT-Sw3-6.7B|21.8/55.2|54.5/69.6|**58.6**/**73.2**|
|
182 |
|GPT-Sw3-6.7B-v2|20.6/53.2|51.2/66.6|58.4/73.0|
|
183 |
+
|Falcon-7B|19.1/40.1|20.6/41.8|22.1/43.6|
|
184 |
+
|Mistral-7B-v0.1|32.5/51.9|35.4/55.1|36.3/56.0|
|
185 |
+
|
186 |
|
187 |
</details>
|
188 |
|
|
|
195 |
|NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
|
196 |
|NorBLOOM-7b|35.6/54.7|36.6/56.3|38.1/57.4|
|
197 |
|NB-GPT-J|1.7/14.7|6.3/34.1|35.2/60.4|
|
|
|
198 |
|GPT-Sw3-6.7B|13.4/44.3|43.6/62.5|**44.5**/63.5|
|
199 |
|GPT-Sw3-6.7B-v2|14.8/45.5|43.7/62.3|44.0/63.6|
|
200 |
+
|Falcon-7B|6.4/28.6|8.3/30.5|9.3/32.1|
|
201 |
+
|Mistral-7B-v0.1|11.6/35.7|13.5/38.7|15.0/40.0|
|
202 |
+
|
203 |
|
204 |
</details>
|
205 |
|
|
|
213 |
|NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
|
214 |
|NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
|
215 |
|NB-GPT-J|9.8/41.4|24.8/58.3|47.6/67.7|
|
|
|
216 |
|GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
|
217 |
|GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
|
218 |
+
|Falcon-7B|21.6/40.6|31.7/47.4|36.6/51.7|
|
219 |
+
|Mistral-7B-v0.1|53.8/68.2|54.6/69.0|56.9/70.7|
|
220 |
+
|
221 |
|
222 |
</details>
|
223 |
|
|
|
230 |
|NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
|
231 |
|NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
|
232 |
|NB-GPT-J|2.9/19.5|10.1/41.0|44.4/66.9|
|
|
|
233 |
|GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
|
234 |
|GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
|
235 |
+
|Falcon-7B|21.6/40.6|31.7/47.4|36.6/57.1|
|
236 |
+
|Mistral-7B-v0.1|40.7/57.1|46.2/60.7|49.9/63.8|
|
237 |
|
238 |
</details>
|
239 |
|
|
|
247 |
|NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
|
248 |
|NorBLOOM-7b|71.5/84.4|70.1/84.1|71.9/85.1|
|
249 |
|NB-GPT-J|6.6/35.5|9.6/41.0|26.0/64.7|
|
|
|
250 |
|GPT-Sw3-6.7B|63.6/82.8|74.7/86.0|75.8/86.9|
|
251 |
|GPT-Sw3-6.7B-v2|57.5/81.1|**75.3**/86.7|**76.7**/**87.6**|
|
252 |
+
|Falcon-7B|28.7/59.2|29.8/60.8|32.1/62.3|
|
253 |
+
|Mistral-7B-v0.1|32.0/62.2|32.9/62.6|35.2/63.9|
|
254 |
+
|
255 |
|
256 |
</details>
|
257 |
|
|
|
264 |
|NorMistral-7b-scratch|85.1/91.4|86.6/92.4|87.4/93.0|
|
265 |
|NorBLOOM-7b|78.7/88.5|84.2/90.7|87.4/93.0|
|
266 |
|NB-GPT-J|2.7/18.5|6.9/35.6|52.9/84.3|
|
|
|
267 |
|GPT-Sw3-6.7B|652.3/82.4|86.1/92.5|87.8/93.6|
|
268 |
|GPT-Sw3-6.7B-v2|72.0/88.6|86.1/92.5|88.2/93.9|
|
269 |
+
|Falcon-7B|36.7/61.6|38.3/63.5|45.8/68.1|
|
270 |
+
|Mistral-7B-v0.1|57.0/74.8|59.9/77.5|62.6/79.1|
|
271 |
|
272 |
</details>
|
273 |
|