aapot commited on
Commit
137bf09
1 Parent(s): 7cf892c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -47
README.md CHANGED
@@ -24,7 +24,7 @@ and first released at [this page](https://github.com/facebookresearch/llama).
24
 
25
  What does Ahma mean? Ahma is the Finnish word for wolverine! In the Finnish Lapland, wolverines are the biggest cause of reindeer damage.
26
 
27
- There are two different sized base Ahma models, all pretrained from scratch for 139B tokens:
28
 
29
  | Model | Context length | Layers | Dim | Heads | Params |
30
  |:--------------------------------------------------------------------------------|:---------------|:-------|:-----|:------|:-------|
@@ -203,40 +203,40 @@ This Ahma 3B base model was primarily evaluated using [FIN-bench by TurkuNLP](ht
203
 
204
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
205
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
206
- | Analogies | 50.77 | 48.46 | TBA | TBA | 49.23 | 40.00 | 54.62 |
207
- | Arithmetic | 27.64 | 22.14 | TBA | TBA | 33.15 | 30.16 | 30.34 |
208
- | Cause and Effect | 59.48 | 58.82 | TBA | TBA | 66.01 | 58.82 | 62.74 |
209
- | Emotions | 36.25 | 28.12 | TBA | TBA | 22.50 | 26.25 | 35.63 |
210
- | Empirical Judgements | 33.33 | 35.35 | TBA | TBA | 27.27 | 33.33 | 49.49 |
211
- | General Knowledge | 44.29 | 48.57 | TBA | TBA | 40.00 | 24.29 | 51.43 |
212
- | HHH Alignment | 42.09 | 41.66 | TBA | TBA | 41.81 | 42.51 | 42.92 |
213
- | Intent Recognition | 24.42 | 26.16 | TBA | TBA | 17.49 | 22.40 | 68.35 |
214
- | Misconceptions | 46.27 | 47.01 | TBA | TBA | 53.73 | 53.73 | 52.24 |
215
- | Paraphrase | 59.50 | 73.00 | TBA | TBA | 51.00 | 50.00 | 51.00 |
216
- | Sentence Ambiguity | 53.33 | 65.00 | TBA | TBA | 51.67 | 48.33 | 50.00 |
217
- | Similarities Abstraction | 65.79 | 68.42 | TBA | TBA | 60.53 | 65.79 | 60.53 |
218
- | **Non-Arithmetic Average** | **47.55** | **48.95** | TBA | TBA | **46.17** | **44.42** | **52.08** |
219
- | **Overall Average** | **36.49** | **34.06** | TBA | TBA | **38.93** | **36.50** | **40.00** |
220
 
221
 
222
  3-shot results:
223
 
224
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
225
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
226
- | Analogies | 50.77 | 49.23 | TBA | TBA | 40.77 | 54.62 | 76.92 |
227
- | Arithmetic | 38.38 | 43.89 | TBA | TBA | 43.63 | 45.78 | 53.68 |
228
- | Cause and Effect | 60.78 | 64.71 | TBA | TBA | 64.05 | 58.17 | 67.32 |
229
- | Emotions | 30.00 | 41.25 | TBA | TBA | 44.37 | 48.13 | 56.87 |
230
- | Empirical Judgements | 46.46 | 44.44 | TBA | TBA | 32.32 | 43.43 | 63.64 |
231
- | General Knowledge | 47.14 | 40.00 | TBA | TBA | 54.29 | 28.57 | 74.29 |
232
- | HHH Alignment | 43.53 | 44.80 | TBA | TBA | 45.39 | 44.80 | 46.07 |
233
- | Intent Recognition | 20.52 | 44.22 | TBA | TBA | 51.45 | 58.82 | 83.67 |
234
- | Misconceptions | 50.75 | 52.24 | TBA | TBA | 52.99 | 46.27 | 52.99 |
235
- | Paraphrase | 50.50 | 58.50 | TBA | TBA | 53.00 | 54.50 | 55.00 |
236
- | Sentence Ambiguity | 53.33 | 48.33 | TBA | TBA | 51.67 | 53.33 | 66.67 |
237
- | Similarities Abstraction | 69.74 | 72.37 | TBA | TBA | 64.47 | 73.68 | 75.00 |
238
- | **Non-Arithmetic Average** | **48.48** | **51.49** | TBA | TBA | **51.19** | **50.94** | **61.96** |
239
- | **Overall Average** | **42.87** | **47.27** | TBA | TBA | **46.99** | **48.07** | **57.36** |
240
 
241
 
242
  As we can see, Ahma 3B base model outperforms 2X larger models like the FinGPT 8B and Viking 7B, especially in non-arithmetic tasks in 0-shot usage. Even the 10X larger Poro 34B model, which is generally better, doesn't show a huge performance difference considering its size, and Ahma 3B actually surpasses it in some tasks. This result might be attributed to Ahma's 2-stage pretraining and the inclusion of instruct-following examples during the pretraining phase.
@@ -252,29 +252,29 @@ Single-turn results:
252
 
253
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct |
254
  |:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|
255
- | Coding | 1.00 | 1.00 | TBA | TBA |
256
- | Extraction | 2.00 | 1.30 | TBA | TBA |
257
- | Humanities | 4.05 | 6.20 | TBA | TBA |
258
- | Math | 3.00 | 3.20 | TBA | TBA |
259
- | Reasoning | 2.90 | 4.60 | TBA | TBA |
260
- | Roleplay | 4.80 | 6.50 | TBA | TBA |
261
- | STEM | 5.10 | 5.95 | TBA | TBA |
262
- | Writing | 6.60 | 9.00 | TBA | TBA |
263
- | **Overall Average** | **3.68** | **4.72** | TBA | TBA |
264
 
265
  Multi-turn results:
266
 
267
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct | Poro 34B Chat |
268
  |:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|:--------------|
269
- | Coding | 1.00 | 1.00 | TBA | TBA | 3.70 |
270
- | Extraction | 1.55 | 1.15 | TBA | TBA | 6.37 |
271
- | Humanities | 3.25 | 6.20 | TBA | TBA | 9.25 |
272
- | Math | 2.20 | 2.70 | TBA | TBA | 1.20 |
273
- | Reasoning | 2.45 | 3.50 | TBA | TBA | 4.35 |
274
- | Roleplay | 4.90 | 6.40 | TBA | TBA | 7.35 |
275
- | STEM | 4.20 | 4.78 | TBA | TBA | 7.80 |
276
- | Writing | 3.80 | 6.65 | TBA | TBA | 8.50 |
277
- | **Overall Average** | **2.92** | **4.05** | TBA | TBA | **6.06** |
278
 
279
  As we can see, Ahma 3B base model struggles with multi-turn examples, as expected, since it has only been pretrained with single-turn instruction following examples. In addition, coding performance was expectedly poor because the Ahma 3B model is not trained with code data. Ahma 3B also seemed to have problems with the fact that it started to constantly repeat the generated text in some evaluation examples, which affected the scoring. With the addition of a repetition penalty setting to the evaluation script generation method, the scores already improved significantly, so the Ahma 3B model should be used with better generation settings in real-world use compared to the settings used in this benchmark.
280
 
 
24
 
25
  What does Ahma mean? Ahma is the Finnish word for wolverine! In the Finnish Lapland, wolverines are the biggest cause of reindeer damage.
26
 
27
+ There are two different sized base Ahma models both pretrained from scratch, Ahma-3B for 139B tokens and Ahma-7B for 149B tokens:
28
 
29
  | Model | Context length | Layers | Dim | Heads | Params |
30
  |:--------------------------------------------------------------------------------|:---------------|:-------|:-----|:------|:-------|
 
203
 
204
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
205
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
206
+ | Analogies | 50.77 | 48.46 | 56.92 | TBA | 49.23 | 40.00 | 54.62 |
207
+ | Arithmetic | 27.64 | 22.14 | 11.50 | TBA | 33.15 | 30.16 | 30.34 |
208
+ | Cause and Effect | 59.48 | 58.82 | 59.48 | TBA | 66.01 | 58.82 | 62.74 |
209
+ | Emotions | 36.25 | 28.12 | 36.25 | TBA | 22.50 | 26.25 | 35.63 |
210
+ | Empirical Judgements | 33.33 | 35.35 | 33.33 | TBA | 27.27 | 33.33 | 49.49 |
211
+ | General Knowledge | 44.29 | 48.57 | 51.43 | TBA | 40.00 | 24.29 | 51.43 |
212
+ | HHH Alignment | 42.09 | 41.66 | 44.23 | TBA | 41.81 | 42.51 | 42.92 |
213
+ | Intent Recognition | 24.42 | 26.16 | 43.64 | TBA | 17.49 | 22.40 | 68.35 |
214
+ | Misconceptions | 46.27 | 47.01 | 46.27 | TBA | 53.73 | 53.73 | 52.24 |
215
+ | Paraphrase | 59.50 | 73.00 | 67.00 | TBA | 51.00 | 50.00 | 51.00 |
216
+ | Sentence Ambiguity | 53.33 | 65.00 | 60.00 | TBA | 51.67 | 48.33 | 50.00 |
217
+ | Similarities Abstraction | 65.79 | 68.42 | 71.05 | TBA | 60.53 | 65.79 | 60.53 |
218
+ | **Non-Arithmetic Average** | **47.55** | **48.95** | **51.33** | TBA | **46.17** | **44.42** | **52.08** |
219
+ | **Overall Average** | **36.49** | **34.06** | **29.20** | TBA | **38.93** | **36.50** | **40.00** |
220
 
221
 
222
  3-shot results:
223
 
224
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
225
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
226
+ | Analogies | 50.77 | 49.23 | 49.23 | TBA | 40.77 | 54.62 | 76.92 |
227
+ | Arithmetic | 38.38 | 43.89 | 20.88 | TBA | 43.63 | 45.78 | 53.68 |
228
+ | Cause and Effect | 60.78 | 64.71 | 66.01 | TBA | 64.05 | 58.17 | 67.32 |
229
+ | Emotions | 30.00 | 41.25 | 30.00 | TBA | 44.37 | 48.13 | 56.87 |
230
+ | Empirical Judgements | 46.46 | 44.44 | 39.39 | TBA | 32.32 | 43.43 | 63.64 |
231
+ | General Knowledge | 47.14 | 40.00 | 27.14 | TBA | 54.29 | 28.57 | 74.29 |
232
+ | HHH Alignment | 43.53 | 44.80 | 43.80 | TBA | 45.39 | 44.80 | 46.07 |
233
+ | Intent Recognition | 20.52 | 44.22 | 36.42 | TBA | 51.45 | 58.82 | 83.67 |
234
+ | Misconceptions | 50.75 | 52.24 | 46.27 | TBA | 52.99 | 46.27 | 52.99 |
235
+ | Paraphrase | 50.50 | 58.50 | 57.50 | TBA | 53.00 | 54.50 | 55.00 |
236
+ | Sentence Ambiguity | 53.33 | 48.33 | 53.33 | TBA | 51.67 | 53.33 | 66.67 |
237
+ | Similarities Abstraction | 69.74 | 72.37 | 72.37 | TBA | 64.47 | 73.68 | 75.00 |
238
+ | **Non-Arithmetic Average** | **48.48** | **51.49** | **49.05** | TBA | **51.19** | **50.94** | **61.96** |
239
+ | **Overall Average** | **42.87** | **47.27** | **33.41** | TBA | **46.99** | **48.07** | **57.36** |
240
 
241
 
242
  As we can see, Ahma 3B base model outperforms 2X larger models like the FinGPT 8B and Viking 7B, especially in non-arithmetic tasks in 0-shot usage. Even the 10X larger Poro 34B model, which is generally better, doesn't show a huge performance difference considering its size, and Ahma 3B actually surpasses it in some tasks. This result might be attributed to Ahma's 2-stage pretraining and the inclusion of instruct-following examples during the pretraining phase.
 
252
 
253
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct |
254
  |:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|
255
+ | Coding | 1.00 | 1.00 | 1.70 | TBA |
256
+ | Extraction | 2.00 | 1.30 | 3.10 | TBA |
257
+ | Humanities | 4.05 | 6.20 | 6.60 | TBA |
258
+ | Math | 3.00 | 3.20 | 3.90 | TBA |
259
+ | Reasoning | 2.90 | 4.60 | 3.70 | TBA |
260
+ | Roleplay | 4.80 | 6.50 | 6.60 | TBA |
261
+ | STEM | 5.10 | 5.95 | 6.75 | TBA |
262
+ | Writing | 6.60 | 9.00 | 7.10 | TBA |
263
+ | **Overall Average** | **3.68** | **4.72** | **4.93** | TBA |
264
 
265
  Multi-turn results:
266
 
267
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct | Poro 34B Chat |
268
  |:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|:--------------|
269
+ | Coding | 1.00 | 1.00 | 1.40 | TBA | 3.70 |
270
+ | Extraction | 1.55 | 1.15 | 2.05 | TBA | 6.37 |
271
+ | Humanities | 3.25 | 6.20 | 4.95 | TBA | 9.25 |
272
+ | Math | 2.20 | 2.70 | 2.50 | TBA | 1.20 |
273
+ | Reasoning | 2.45 | 3.50 | 2.55 | TBA | 4.35 |
274
+ | Roleplay | 4.90 | 6.40 | 6.35 | TBA | 7.35 |
275
+ | STEM | 4.20 | 4.78 | 4.28 | TBA | 7.80 |
276
+ | Writing | 3.80 | 6.65 | 4.10 | TBA | 8.50 |
277
+ | **Overall Average** | **2.92** | **4.05** | **3.52** | TBA | **6.06** |
278
 
279
  As we can see, Ahma 3B base model struggles with multi-turn examples, as expected, since it has only been pretrained with single-turn instruction following examples. In addition, coding performance was expectedly poor because the Ahma 3B model is not trained with code data. Ahma 3B also seemed to have problems with the fact that it started to constantly repeat the generated text in some evaluation examples, which affected the scoring. With the addition of a repetition penalty setting to the evaluation script generation method, the scores already improved significantly, so the Ahma 3B model should be used with better generation settings in real-world use compared to the settings used in this benchmark.
280