Update README.md
Browse files
README.md
CHANGED
@@ -1,11 +1,77 @@
|
|
1 |
---
|
2 |
-
|
3 |
-
-
|
4 |
-
language:
|
5 |
-
- dutch
|
6 |
datasets:
|
7 |
-
-
|
8 |
-
|
9 |
-
|
10 |
-
-
|
|
|
|
|
|
|
11 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- nl
|
|
|
|
|
4 |
datasets:
|
5 |
+
- yhavinga/mc4_nl_cleaned
|
6 |
+
- ml6team/cnn_dailymail_nl
|
7 |
+
tags:
|
8 |
+
- seq2seq
|
9 |
+
- lm-head
|
10 |
+
license: apache-2.0
|
11 |
+
inference: false
|
12 |
---
|
13 |
+
|
14 |
+
# T5 v1.1 Large finetuned for CNN news summarization in Dutch 🇳🇱
|
15 |
+
|
16 |
+
This model is [t5-v1.1-large-dutch-cased](https://huggingface.co/yhavinga/t5-v1.1-large-dutch-cased) finetuned on [CNN Dailymail NL](https://huggingface.co/datasets/ml6team/cnn_dailymail_nl)
|
17 |
+
|
18 |
+
The inference widget on the right has been turned off. For a **demo** of the Dutch CNN summarization models, head over to the
|
19 |
+
Hugging Face Spaces for the **[Netherformer 📰](https://huggingface.co/spaces/flax-community/netherformer)** example application!
|
20 |
+
|
21 |
+
## Tokenizer
|
22 |
+
|
23 |
+
* Tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
|
24 |
+
Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
|
25 |
+
|
26 |
+
## Dataset
|
27 |
+
|
28 |
+
All models listed below are trained on of the `full` configuration (39B tokens) of
|
29 |
+
[cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
|
30 |
+
which is the original mC4, except
|
31 |
+
|
32 |
+
* Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
|
33 |
+
* Sentences with less than 3 words are removed
|
34 |
+
* Sentences with a word of more than 1000 characters are removed
|
35 |
+
* Documents with less than 5 sentences are removed
|
36 |
+
* Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
|
37 |
+
"use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
|
38 |
+
|
39 |
+
## Models
|
40 |
+
|
41 |
+
* The first model, `t5-base-dutch` is a re-training of the Dutch T5 base v1.0 model trained during the Flax/Jax community
|
42 |
+
week. With training complete, accuracy was improved from 0,64 to 0,70.
|
43 |
+
* The second two models are a uncased and cased version of `t5-v1.1-base`, again pre-trained from scratch on Dutch,
|
44 |
+
with a tokenizer also trained from scratch. The t5 v1.1 models are slightly different from the t5 models, and the
|
45 |
+
base models are trained with a dropout of 0.0. For fine-tuning it is intended to set this back to 0.1.
|
46 |
+
* The large cased model is a pre-trained Dutch version of `t5-v1.1-large`. Training of t5-v1.1-large proved difficult.
|
47 |
+
Without dropout regularization, the training would diverge at a certain point. With dropout training went better,
|
48 |
+
be it much slower than training the t5-model. At some point convergance was too slow to warrant further training.
|
49 |
+
The latest checkpoint, training scripts and metrics are available for reference. For actual fine-tuning the cased
|
50 |
+
base model is probably the better choice.
|
51 |
+
|
52 |
+
| | model | train seq len | acc | loss | batch size | epochs | steps | dropout | optim | lr | duration |
|
53 |
+
|----------------------------|---------|---------------|----------|----------|------------|--------|---------|---------|-----------|------|----------|
|
54 |
+
| t5-base-dutch | T5 | 512 | 0,70 | 1,38 | 128 | 1 | 528481 | 0.1 | adafactor | 5e-3 | 2d 9h |
|
55 |
+
| t5-v1.1-base-dutch-uncased | t5-v1.1 | 1024 | 0,73 | 1,20 | 64 | 2 | 1014525 | 0.0 | adafactor | 5e-3 | 5d 5h |
|
56 |
+
| t5-v1.1-base-dutch-cased | t5-v1.1 | 1024 | **0,78** | **0,96** | 64 | 2 | 1210000 | 0.0 | adafactor | 5e-3 | 6d 6h |
|
57 |
+
| t5-v1.1-large-dutch-cased | t5-v1.1 | 512 | 0,76 | 1,07 | 64 | 1 | 1120000 | 0.1 | adafactor | 5e-3 | 86 13h |
|
58 |
+
|
59 |
+
The cased t5-v1.1 Dutch models were fine-tuned on summarizing the CNN Daily Mail dataset.
|
60 |
+
|
61 |
+
| | model | input len | target len | Rouge1 | Rouge2 | RougeL | RougeLsum | Test Gen Len | epochs | batch size | steps | duration |
|
62 |
+
|------------------------------|---------|-----------|------------|--------|--------|--------|-----------|--------------|--------|------------|-------|----------|
|
63 |
+
| t5-v1.1-base-dutch-cnn-test | t5-v1.1 | 1024 | 96 | 34,8 | 13,6 | 25,2 | 32,1 | 79 | 6 | 64 | 26916 | 2h 40m |
|
64 |
+
| t5-v1.1-large-dutch-cnn-test | t5-v1.1 | 1024 | 96 | 34,4 | 13,6 | 25,3 | 31,7 | 81 | 5 | 16 | 89720 | 11h |
|
65 |
+
|
66 |
+
|
67 |
+
## Acknowledgements
|
68 |
+
|
69 |
+
This project would not have been possible without compute generously provided by Google through the
|
70 |
+
[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
|
71 |
+
instrumental in many, if not all parts of the training. The following repositories where helpful in setting up the TPU-VM,
|
72 |
+
and getting an idea what sensible hyper-parameters are for training gpt2 from scratch.
|
73 |
+
|
74 |
+
* [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
|
75 |
+
* [Flax/Jax Community week t5-base-dutch](https://huggingface.co/flax-community/t5-base-dutch)
|
76 |
+
|
77 |
+
Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)
|