Yeb Havinga
commited on
Commit
·
defd527
1
Parent(s):
f333877
Autoupdate README.md
Browse files- README.md +127 -87
- evaluation_t5_dutch_english.png +0 -0
README.md
CHANGED
@@ -1,6 +1,7 @@
|
|
1 |
---
|
2 |
language:
|
3 |
- nl
|
|
|
4 |
datasets:
|
5 |
- yhavinga/mc4_nl_cleaned
|
6 |
tags:
|
@@ -14,17 +15,17 @@ license: apache-2.0
|
|
14 |
# t5-small-24L-dutch-english
|
15 |
|
16 |
A [T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) sequence to sequence model
|
17 |
-
pre-trained from scratch on [cleaned Dutch 🇳🇱🇧🇪
|
|
|
|
|
18 |
|
19 |
|
20 |
This **t5 eff** model has **249M** parameters.
|
21 |
-
It was pre-trained on the dataset
|
22 |
`mc4_nl_cleaned` config `large_en_nl` for **1** epoch(s) and a duration of **4d10h**,
|
23 |
-
with a sequence length of **512**, batch size **128** and **851852** total steps.
|
24 |
Pre-training evaluation loss and accuracy are **1,18** and **0,74**.
|
25 |
-
|
26 |
-
(note: this evaluation model was not saved).
|
27 |
-
|
28 |
* Pre-trained T5 models need to be finetuned before they can be used for downstream tasks, therefore the inference widget on the right has been turned off.
|
29 |
* For a demo of the Dutch CNN summarization models, head over to the Hugging Face Spaces for
|
30 |
the **[Netherformer 📰](https://huggingface.co/spaces/flax-community/netherformer)** example application!
|
@@ -35,9 +36,6 @@ and configs, though it must be noted that this model (t5-small-24L-dutch-english
|
|
35 |
* **[Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers](https://arxiv.org/abs/2109.10686)** by *Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler*.
|
36 |
|
37 |
|
38 |
-
![model image](https://camo.githubusercontent.com/623b4dea0b653f2ad3f36c71ebfe749a677ac0a1/68747470733a2f2f6d69726f2e6d656469756d2e636f6d2f6d61782f343030362f312a44304a31674e51663876727255704b657944387750412e706e67)
|
39 |
-
|
40 |
-
|
41 |
## Tokenizer
|
42 |
|
43 |
The model uses a cased SentencePiece tokenizer configured with the `Nmt, NFKC, Replace multi-space to single-space` normalizers
|
@@ -45,9 +43,9 @@ and has 32003 tokens.
|
|
45 |
It was trained on Dutch and English with scripts from the Huggingface Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
|
46 |
See [./raw/main/tokenizer.json](tokenizer.json) for details.
|
47 |
|
48 |
-
## Dataset
|
49 |
|
50 |
-
All models listed below are trained on
|
51 |
[cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
|
52 |
which is the original mC4, except
|
53 |
|
@@ -58,96 +56,138 @@ which is the original mC4, except
|
|
58 |
* Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
|
59 |
"use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
|
60 |
|
61 |
-
The Dutch and English models are trained on a 50/50% mix of Dutch mC4 and English C4.
|
|
|
|
|
62 |
|
63 |
-
## Models
|
64 |
|
65 |
-
Three types of models have been trained.
|
|
|
66 |
The other model types t5-v1.1 and t5-eff have `gated-relu` instead of `relu` as activation function,
|
67 |
and trained with a drop-out of `0.0` unless training would diverge (`t5-v1.1-large-dutch-cased`).
|
68 |
-
The T5-eff models are models
|
69 |
-
the several dimensions of these models.
|
70 |
-
|
71 |
|
72 |
-
| | t5-base-dutch | t5-v1.1-base-dutch-uncased | t5-v1.1-base-dutch-cased | t5-v1.1-large-dutch-cased | t5-v1_1-base-dutch-english-cased | t5-v1_1-base-dutch-english-cased-1024 | t5-small-24L-dutch-english | t5-xl-4L-dutch-english-cased | t5-base-36L-dutch-english-cased | t5-eff-xl-8l-dutch-english-cased | t5-eff-large-8l-dutch-english-cased |
|
73 |
|:------------------|:----------------|:-----------------------------|:---------------------------|:----------------------------|:-----------------------------------|:----------------------------------------|:-----------------------------|:-------------------------------|:----------------------------------|:-----------------------------------|:--------------------------------------|
|
74 |
-
| type
|
75 |
-
| d_model
|
76 |
-
| d_ff
|
77 |
-
| num_heads
|
78 |
-
| d_kv
|
79 |
-
| num_layers
|
80 |
-
| num parameters
|
81 |
-
| feed_forward_proj | relu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu |
|
82 |
-
| dropout
|
83 |
-
| dataset
|
84 |
-
| tr. seq len
|
85 |
-
| batch size
|
86 |
-
| total steps
|
87 |
-
| epochs
|
88 |
-
| duration
|
89 |
-
| optimizer
|
90 |
-
| lr
|
91 |
-
| warmup
|
92 |
-
| eval loss
|
93 |
-
| eval acc
|
94 |
-
|
95 |
-
## Evaluation
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
-
|
101 |
-
|
102 |
-
|
103 |
-
|
104 |
-
|
105 |
-
|
106 |
-
|
107 |
-
|
108 |
-
|
|
109 |
-
|
110 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
111 |
|
112 |
## Translation models
|
113 |
|
114 |
-
The small
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
|
124 |
-
|
125 |
-
|
|
126 |
-
|
|
127 |
-
|
|
128 |
-
|
|
129 |
-
| tatoeba_bp
|
130 |
-
|
|
131 |
-
|
|
132 |
-
|
|
133 |
-
|
|
134 |
-
|
|
135 |
-
|
|
136 |
-
|
|
137 |
-
|
|
138 |
-
|
|
139 |
-
|
|
140 |
-
|
|
|
|
|
|
|
|
141 |
|
142 |
## Acknowledgements
|
143 |
|
144 |
This project would not have been possible without compute generously provided by Google through the
|
145 |
-
[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem
|
146 |
-
|
147 |
-
|
148 |
-
have completed this project otherwise.
|
149 |
The following repositories where helpful in setting up the TPU-VM,
|
150 |
-
and getting an idea what sensible hyper-parameters are for training gpt2 from scratch
|
151 |
|
152 |
* [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
|
153 |
* [Flax/Jax Community week t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch)
|
|
|
1 |
---
|
2 |
language:
|
3 |
- nl
|
4 |
+
- en
|
5 |
datasets:
|
6 |
- yhavinga/mc4_nl_cleaned
|
7 |
tags:
|
|
|
15 |
# t5-small-24L-dutch-english
|
16 |
|
17 |
A [T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) sequence to sequence model
|
18 |
+
pre-trained from scratch on [cleaned Dutch 🇳🇱🇧🇪 mC4 and cleaned English 🇬🇧 C4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned).
|
19 |
+
|
20 |
+
|
21 |
|
22 |
|
23 |
This **t5 eff** model has **249M** parameters.
|
24 |
+
It was pre-trained with masked language modeling (denoise token span corruption) objective on the dataset
|
25 |
`mc4_nl_cleaned` config `large_en_nl` for **1** epoch(s) and a duration of **4d10h**,
|
26 |
+
with a sequence length of **512**, batch size **128** and **851852** total steps (**56B** tokens).
|
27 |
Pre-training evaluation loss and accuracy are **1,18** and **0,74**.
|
28 |
+
Refer to the evaluation section below for a comparison of the pre-trained models on summarization and translation.
|
|
|
|
|
29 |
* Pre-trained T5 models need to be finetuned before they can be used for downstream tasks, therefore the inference widget on the right has been turned off.
|
30 |
* For a demo of the Dutch CNN summarization models, head over to the Hugging Face Spaces for
|
31 |
the **[Netherformer 📰](https://huggingface.co/spaces/flax-community/netherformer)** example application!
|
|
|
36 |
* **[Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers](https://arxiv.org/abs/2109.10686)** by *Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler*.
|
37 |
|
38 |
|
|
|
|
|
|
|
39 |
## Tokenizer
|
40 |
|
41 |
The model uses a cased SentencePiece tokenizer configured with the `Nmt, NFKC, Replace multi-space to single-space` normalizers
|
|
|
43 |
It was trained on Dutch and English with scripts from the Huggingface Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
|
44 |
See [./raw/main/tokenizer.json](tokenizer.json) for details.
|
45 |
|
46 |
+
## Dataset(s)
|
47 |
|
48 |
+
All models listed below are pre-trained on
|
49 |
[cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
|
50 |
which is the original mC4, except
|
51 |
|
|
|
56 |
* Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
|
57 |
"use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
|
58 |
|
59 |
+
The Dutch and English models are pre-trained on a 50/50% mix of Dutch mC4 and English C4.
|
60 |
+
|
61 |
+
The translation models are fine-tuned on [CCMatrix](https://huggingface.co/datasets/yhavinga/ccmatrix).
|
62 |
|
63 |
+
## Dutch T5 Models
|
64 |
|
65 |
+
Three types of [Dutch T5 models have been trained (blog)](https://huggingface.co/spaces/yhavinga/pre-training-dutch-t5-models).
|
66 |
+
`t5-base-dutch` is the only model with an original T5 config.
|
67 |
The other model types t5-v1.1 and t5-eff have `gated-relu` instead of `relu` as activation function,
|
68 |
and trained with a drop-out of `0.0` unless training would diverge (`t5-v1.1-large-dutch-cased`).
|
69 |
+
The T5-eff models are models that differ in their number of layers. The table will list
|
70 |
+
the several dimensions of these models. Not all t5-eff models are efficient, the best example being the inefficient
|
71 |
+
`t5-xl-4L-dutch-english-cased`.
|
72 |
|
73 |
+
| | [t5-base-dutch](https://huggingface.co/yhavinga/t5-base-dutch) | [t5-v1.1-base-dutch-uncased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-uncased) | [t5-v1.1-base-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cased) | [t5-v1.1-large-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-large-dutch-cased) | [t5-v1_1-base-dutch-english-cased](https://huggingface.co/yhavinga/t5-v1_1-base-dutch-english-cased) | [t5-v1_1-base-dutch-english-cased-1024](https://huggingface.co/yhavinga/t5-v1_1-base-dutch-english-cased-1024) | [t5-small-24L-dutch-english](https://huggingface.co/yhavinga/t5-small-24L-dutch-english) | [t5-xl-4L-dutch-english-cased](https://huggingface.co/yhavinga/t5-xl-4L-dutch-english-cased) | [t5-base-36L-dutch-english-cased](https://huggingface.co/yhavinga/t5-base-36L-dutch-english-cased) | [t5-eff-xl-8l-dutch-english-cased](https://huggingface.co/yhavinga/t5-eff-xl-8l-dutch-english-cased) | [t5-eff-large-8l-dutch-english-cased](https://huggingface.co/yhavinga/t5-eff-large-8l-dutch-english-cased) |
|
74 |
|:------------------|:----------------|:-----------------------------|:---------------------------|:----------------------------|:-----------------------------------|:----------------------------------------|:-----------------------------|:-------------------------------|:----------------------------------|:-----------------------------------|:--------------------------------------|
|
75 |
+
| *type* | t5 | t5-v1.1 | t5-v1.1 | t5-v1.1 | t5-v1.1 | t5-v1.1 | t5 eff | t5 eff | t5 eff | t5 eff | t5 eff |
|
76 |
+
| *d_model* | 768 | 768 | 768 | 1024 | 768 | 768 | 512 | 2048 | 768 | 1024 | 1024 |
|
77 |
+
| *d_ff* | 3072 | 2048 | 2048 | 2816 | 2048 | 2048 | 1920 | 5120 | 2560 | 16384 | 4096 |
|
78 |
+
| *num_heads* | 12 | 12 | 12 | 16 | 12 | 12 | 8 | 32 | 12 | 32 | 16 |
|
79 |
+
| *d_kv* | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 128 | 64 |
|
80 |
+
| *num_layers* | 12 | 12 | 12 | 24 | 12 | 12 | 24 | 4 | 36 | 8 | 8 |
|
81 |
+
| *num parameters* | 223M | 248M | 248M | 783M | 248M | 248M | 250M | 585M | 729M | 1241M | 335M |
|
82 |
+
| *feed_forward_proj* | relu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu | gated-gelu |
|
83 |
+
| *dropout* | 0.1 | 0.0 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 |
|
84 |
+
| *dataset* | mc4_nl_cleaned | mc4_nl_cleaned full | mc4_nl_cleaned full | mc4_nl_cleaned | mc4_nl_cleaned small_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl | mc4_nl_cleaned large_en_nl |
|
85 |
+
| *tr. seq len* | 512 | 1024 | 1024 | 512 | 512 | 1024 | 512 | 512 | 512 | 512 | 512 |
|
86 |
+
| *batch size* | 128 | 64 | 64 | 64 | 128 | 64 | 128 | 512 | 512 | 64 | 128 |
|
87 |
+
| *total steps* | 527500 | 1014525 | 1210154 | 1120k/2427498 | 2839630 | 1520k/3397024 | 851852 | 212963 | 212963 | 538k/1703705 | 851850 |
|
88 |
+
| *epochs* | 1 | 2 | 2 | 2 | 10 | 4 | 1 | 1 | 1 | 1 | 1 |
|
89 |
+
| *duration* | 2d9h | 5d5h | 6d6h | 8d13h | 11d18h | 9d1h | 4d10h | 6d1h | 17d15h | 4d 19h | 3d 23h |
|
90 |
+
| *optimizer* | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor | adafactor |
|
91 |
+
| *lr* | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.009 | 0.005 | 0.005 |
|
92 |
+
| *warmup* | 10000.0 | 10000.0 | 10000.0 | 10000.0 | 10000.0 | 5000.0 | 20000.0 | 2500.0 | 1000.0 | 1500.0 | 1500.0 |
|
93 |
+
| *eval loss* | 1,38 | 1,20 | 0,96 | 1,07 | 1,11 | 1,13 | 1,18 | 1,27 | 1,05 | 1,3019 | 1,15 |
|
94 |
+
| *eval acc* | 0,70 | 0,73 | 0,78 | 0,76 | 0,75 | 0,74 | 0,74 | 0,72 | 0,76 | 0,71 | 0,74 |
|
95 |
+
|
96 |
+
## Evaluation
|
97 |
+
|
98 |
+
Most models from the list above have been fine-tuned for summarization and translation.
|
99 |
+
The figure below shows the evaluation scores, where the x-axis shows the translation Bleu score (higher is better)
|
100 |
+
and y-axis the summarization Rouge1 translation score (higher is better).
|
101 |
+
Point size is proportional to the model size. Models with faster inference speed are green, slower inference speed is
|
102 |
+
plotted as bleu.
|
103 |
+
|
104 |
+
![Evaluation T5 Dutch English](evaluation_t5_dutch_english.png)
|
105 |
+
|
106 |
+
Evaluation was run on fine-tuned models trained with the following settings:
|
107 |
+
|
108 |
+
|
109 |
+
| | Summarization | Translation |
|
110 |
+
|---------------:|------------------|-------------------|
|
111 |
+
| Dataset | CNN Dailymail NL | CCMatrix en -> nl |
|
112 |
+
| #train samples | 50K | 50K |
|
113 |
+
| Optimizer | Adam | Adam |
|
114 |
+
| learning rate | 0.001 | 0.0005 |
|
115 |
+
| source length | 1024 | 128 |
|
116 |
+
| target length | 142 | 128 |
|
117 |
+
|label smoothing | 0.05 | 0.1 |
|
118 |
+
| #eval samples | 1000 | 1000 |
|
119 |
+
|
120 |
+
Note that the amount of training data is limited to a fraction of the total dataset sizes, therefore the scores
|
121 |
+
below can only be used to compare the 'transfer-learning' strength. The fine-tuned checkpoints for this evaluation
|
122 |
+
are not saved, since they were trained for comparison of pre-trained models only.
|
123 |
+
|
124 |
+
The numbers for summarization are the Rouge scores on 1000 documents from the test split.
|
125 |
+
|
126 |
+
| | [t5-base-dutch](https://huggingface.co/yhavinga/t5-base-dutch) | [t5-v1.1-base-dutch-uncased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-uncased) | [t5-v1.1-base-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cased) | [t5-v1_1-base-dutch-english-cased](https://huggingface.co/yhavinga/t5-v1_1-base-dutch-english-cased) | [t5-v1_1-base-dutch-english-cased-1024](https://huggingface.co/yhavinga/t5-v1_1-base-dutch-english-cased-1024) | [t5-small-24L-dutch-english](https://huggingface.co/yhavinga/t5-small-24L-dutch-english) | [t5-xl-4L-dutch-english-cased](https://huggingface.co/yhavinga/t5-xl-4L-dutch-english-cased) | [t5-base-36L-dutch-english-cased](https://huggingface.co/yhavinga/t5-base-36L-dutch-english-cased) | [t5-eff-large-8l-dutch-english-cased](https://huggingface.co/yhavinga/t5-eff-large-8l-dutch-english-cased) | mt5-base |
|
127 |
+
|:------------------------|----------------:|-----------------------------:|---------------------------:|-----------------------------------:|----------------------------------------:|-----------------------------:|-------------------------------:|----------------------------------:|--------------------------------------:|-----------:|
|
128 |
+
| *rouge1* | 33.38 | 33.97 | 34.39 | 33.38 | 34.97 | 34.38 | 30.35 | **35.04** | 34.04 | 33.25 |
|
129 |
+
| *rouge2* | 13.32 | 13.85 | 13.98 | 13.47 | 14.01 | 13.89 | 11.57 | **14.23** | 13.76 | 12.74 |
|
130 |
+
| *rougeL* | 24.22 | 24.72 | 25.1 | 24.34 | 24.99 | **25.25** | 22.69 | 25.05 | 24.75 | 23.5 |
|
131 |
+
| *rougeLsum* | 30.23 | 30.9 | 31.44 | 30.51 | 32.01 | 31.38 | 27.5 | **32.12** | 31.12 | 30.15 |
|
132 |
+
| *samples_per_second* | 3.18 | 3.02 | 2.99 | 3.22 | 2.97 | 1.57 | 2.8 | 0.61 | **3.27** | 1.22 |
|
133 |
+
|
134 |
+
The models below have been evaluated for English to Dutch translation.
|
135 |
+
Note that the first four models are pre-trained on Dutch only. That they still perform adequate is probably because
|
136 |
+
the translation direction is English to Dutch.
|
137 |
+
The numbers reported are the Bleu scores on 1000 documents from the test split.
|
138 |
+
|
139 |
+
| | [t5-base-dutch](https://huggingface.co/yhavinga/t5-base-dutch) | [t5-v1.1-base-dutch-uncased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-uncased) | [t5-v1.1-base-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-base-dutch-cased) | [t5-v1.1-large-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-large-dutch-cased) | [t5-v1_1-base-dutch-english-cased](https://huggingface.co/yhavinga/t5-v1_1-base-dutch-english-cased) | [t5-v1_1-base-dutch-english-cased-1024](https://huggingface.co/yhavinga/t5-v1_1-base-dutch-english-cased-1024) | [t5-small-24L-dutch-english](https://huggingface.co/yhavinga/t5-small-24L-dutch-english) | [t5-xl-4L-dutch-english-cased](https://huggingface.co/yhavinga/t5-xl-4L-dutch-english-cased) | [t5-base-36L-dutch-english-cased](https://huggingface.co/yhavinga/t5-base-36L-dutch-english-cased) | [t5-eff-large-8l-dutch-english-cased](https://huggingface.co/yhavinga/t5-eff-large-8l-dutch-english-cased) | mt5-base |
|
140 |
+
|:-------------------------------|----------------:|-----------------------------:|---------------------------:|----------------------------:|-----------------------------------:|----------------------------------------:|-----------------------------:|-------------------------------:|----------------------------------:|--------------------------------------:|-----------:|
|
141 |
+
| *precision_ng1* | 74.17 | 78.09 | 77.08 | 72.12 | 77.19 | 78.76 | 78.59 | 77.3 | **79.75** | 78.88 | 73.47 |
|
142 |
+
| *precision_ng2* | 52.42 | 57.52 | 55.31 | 48.7 | 55.39 | 58.01 | 57.83 | 55.27 | **59.89** | 58.27 | 50.12 |
|
143 |
+
| *precision_ng3* | 39.55 | 45.2 | 42.54 | 35.54 | 42.25 | 45.13 | 45.02 | 42.06 | **47.4** | 45.95 | 36.59 |
|
144 |
+
| *precision_ng4* | 30.23 | 36.04 | 33.26 | 26.27 | 32.74 | 35.72 | 35.41 | 32.61 | **38.1** | 36.91 | 27.26 |
|
145 |
+
| *bp* | 0.99 | 0.98 | 0.97 | 0.98 | 0.98 | 0.98 | 0.98 | 0.97 | 0.98 | 0.98 | 0.98 |
|
146 |
+
| *score* | 45.88 | 51.21 | 48.31 | 41.59 | 48.17 | 51.31 | 50.82 | 47.83 | **53** | 51.79 | 42.74 |
|
147 |
+
| *samples_per_second* | **45.19** | 45.05 | 38.67 | 10.12 | 42.19 | 42.61 | 12.85 | 33.74 | 9.07 | 37.86 | 9.03 |
|
148 |
+
|
149 |
|
150 |
## Translation models
|
151 |
|
152 |
+
The models `t5-small-24L-dutch-english` and `t5-base-36L-dutch-english` have been fine-tuned for both language
|
153 |
+
directions on the first 25M samples from CCMatrix, giving a total of 50M training samples.
|
154 |
+
Evaluation is performed on out-of-sample CCMatrix and also on Tatoeba and Opus Books.
|
155 |
+
The `_bp` columns list the *brevity penalty*. The `avg_bleu` score is the bleu score
|
156 |
+
averaged over all three evaluation datasets. The best scores displayed in bold for both translation directions.
|
157 |
+
|
158 |
+
| | [t5-base-36L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-base-36L-ccmatrix-multi) | [t5-base-36L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-base-36L-ccmatrix-multi) | [t5-small-24L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-small-24L-ccmatrix-multi) | [t5-small-24L-ccmatrix-multi](https://huggingface.co/yhavinga/t5-small-24L-ccmatrix-multi) |
|
159 |
+
|:-----------------------|:-----------------------------|:-----------------------------|:------------------------------|:------------------------------|
|
160 |
+
| *source_lang* | en | nl | en | nl |
|
161 |
+
| *target_lang* | nl | en | nl | en |
|
162 |
+
| *source_prefix* | translate English to Dutch: | translate Dutch to English: | translate English to Dutch: | translate Dutch to English: |
|
163 |
+
| *ccmatrix_bleu* | **56.8** | 62.8 | 57.4 | **63.1** |
|
164 |
+
| *tatoeba_bleu* | **46.6** | **52.8** | 46.4 | 51.7 |
|
165 |
+
| *opus_books_bleu* | **13.5** | **24.9** | 12.9 | 23.4 |
|
166 |
+
| *ccmatrix_bp* | 0.95 | 0.96 | 0.95 | 0.96 |
|
167 |
+
| *tatoeba_bp* | 0.97 | 0.94 | 0.98 | 0.94 |
|
168 |
+
| *opus_books_bp* | 0.8 | 0.94 | 0.77 | 0.89 |
|
169 |
+
| *avg_bleu* | **38.96** | **46.86** | 38.92 | 46.06 |
|
170 |
+
| *max_source_length* | 128 | 128 | 128 | 128 |
|
171 |
+
| *max_target_length* | 128 | 128 | 128 | 128 |
|
172 |
+
| *adam_beta1* | 0.9 | 0.9 | 0.9 | 0.9 |
|
173 |
+
| *adam_beta2* | 0.997 | 0.997 | 0.997 | 0.997 |
|
174 |
+
| *weight_decay* | 0.05 | 0.05 | 0.002 | 0.002 |
|
175 |
+
| *lr* | 5e-05 | 5e-05 | 0.0005 | 0.0005 |
|
176 |
+
| *label_smoothing_factor* | 0.15 | 0.15 | 0.1 | 0.1 |
|
177 |
+
| *train_batch_size* | 128 | 128 | 128 | 128 |
|
178 |
+
| *warmup_steps* | 2000 | 2000 | 2000 | 2000 |
|
179 |
+
| *total steps* | 390625 | 390625 | 390625 | 390625 |
|
180 |
+
| *duration* | 4d 5h | 4d 5h | 3d 2h | 3d 2h |
|
181 |
+
| *num parameters* | 729M | 729M | 250M | 250M |
|
182 |
|
183 |
## Acknowledgements
|
184 |
|
185 |
This project would not have been possible without compute generously provided by Google through the
|
186 |
+
[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was instrumental in all parts
|
187 |
+
of the training. Weights & Biases made it possible to keep track of many training sessions
|
188 |
+
and orchestrate hyper-parameter sweeps with insightful visualizations.
|
|
|
189 |
The following repositories where helpful in setting up the TPU-VM,
|
190 |
+
and getting an idea what sensible hyper-parameters are for training gpt2 from scratch:
|
191 |
|
192 |
* [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
|
193 |
* [Flax/Jax Community week t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch)
|
evaluation_t5_dutch_english.png
ADDED