yhavinga
/

gpt-neo-125M-dutch-nedd

@@ -6,42 +6,76 @@ widget:
 - text: "Studenten en leraren van de Bogazici Universiteit in de Turkse stad Istanbul"
 - text: "In Israël was een strenge lockdown"
 tags:
-- gpt-neo-125M
-- gpt-neo
-- text generation
-- pytorch
-- causal-lm
 pipeline_tag: text-generation
 datasets:
 - yhavinga/mc4_nl_cleaned
 ---
-# # GPT-Neo 125M pre-trained on cleaned Dutch mC4 🇳🇱
-Dataset:
-* [mC4 NL Cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned)
-* dataset config: mc4 nl filtered with only newspapers and wikipedia
-* total tokens: 3.9B
-Tokenizer:
-* Tokenizer trained on mC4 with scripts from the Huggingface
-  Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
-Training details:
-* Trained for 558608 steps with batch size 128
-* Optimizer: AdamW
-* Block size: 512
-* Learning rate: 2.4e-3
-* Warmup steps: 5000
-* Epochs: 8
-Jan 2022
-* Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
-* Thanks to @gsarti for creating the [t5-flax-gcp
-  repository](https://github.com/gsarti/t5-flax-gcp).
-* Also thanks to the creators of [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian) and
-  [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
-  for sharing their training scripts!

 - text: "Studenten en leraren van de Bogazici Universiteit in de Turkse stad Istanbul"
 - text: "In Israël was een strenge lockdown"
 tags:
+- gpt2-medium
+- gpt2
 pipeline_tag: text-generation
 datasets:
 - yhavinga/mc4_nl_cleaned
 ---
+# GPT-Neo 125M pre-trained on cleaned Dutch mC4 🇳🇱
+A GPT-Neo small model (125M paramters) trained from scratch on Dutch, with perplexity 19.9 on cleaned Dutch mC4.
+## How To Use
+You can use this GPT-Neo model directly with a pipeline for text generation.
+```python
+MODEL_DIR='yhavinga/gpt-neo-125M-dutch'
+from transformers import pipeline, GPT2Tokenizer, GPTNeoForCausalLM
+tokenizer = GPT2Tokenizer.from_pretrained(MODEL_DIR)
+model = GPTNeoForCausalLM.from_pretrained(MODEL_DIR)
+generator = pipeline('text-generation', model, tokenizer=tokenizer)
+generated_text = generator('Wetenschappers verbonden aan de Katholieke Universiteit', max_length=100, do_sample=True, top_k=40, top_p=0.95, repetition_penalty=2.0))
+```
+*"Wetenschappers verbonden aan de Katholieke Universiteit van Nijmegen" - "hebben er in het laatste nummer dat deze week verschijnt nog niets over gezegd. De wetenschappers verwachten pas volgend jaar meer duidelijkheid te kunnen geven, zo blijkt uit onderzoek door een vakblad en op Facebook onder studenten die denken mee te moeten werken om hun studie af te maken.
+In augustus 2017 kwam al naar buiten wat eraan schortten: hogescholen zouden moeite hebben met excel-software, ze hadden niet voldoende tijd om alle"*
+## Tokenizer
+* BPE tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
+  Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
+## Dataset
+This model was trained on the wikipedia and newspapers (3.9B tokens) webpages in
+[cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
+which is the original mC4, except
+  * Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
+  * Sentences with less than 3 words are removed
+  * Sentences with a word of more than 1000 characters are removed
+  * Documents with less than 5 sentences are removed
+  * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
+    "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
+## Models
+TL;DR: [yhavinga/gpt2-medium-dutch](https://huggingface.co/yhavinga/gpt2-medium-dutch) is the best model.
+* `yhavinga/gpt-neo-125M-dutch` is trained on a fraction of C4 containing only wikipedia and news sites.
+* The models with `a`/`b` in the step-column have been trained to step `a` of a total of `b` steps.
+|                                                                                   | model   | params | train seq len | ppl  | loss | batch size | epochs | steps           | optim     | lr     | duration | config    |
+|-----------------------------------------------------------------------------------|---------|--------|---------------|------|------|------------|--------|-----------------|-----------|--------|----------|-----------|
+| [yhavinga/gpt-neo-125M-dutch](https://huggingface.co/yhavinga/gpt-neo-125M-dutch) | gpt neo | 125M   | 512           | 19.9 | 2.99 | 128        | 8      | 558608          | adamw     | 2.4e-3 | 1d 12h   | news+wiki |
+| [yhavinga/gpt2-medium-dutch](https://huggingface.co/yhavinga/gpt2-medium-dutch)   | gpt2    | 345M   | 512           | 15.1 | 2.71 | 128        | 4      | 320000/520502   | adafactor | 8e-4   | 7d 2h    | full      |
+| [yhavinga/gpt2-large-dutch](https://huggingface.co/yhavinga/gpt2-large-dutch)     | gpt2    | 762M   | 512           | 15.1 | 2.72 | 32         | 1      | 1100000/2082009 | adafactor | 3.3e-5 | 8d 15h   | large     |
+| [yhavinga/gpt-neo-1.3B-dutch](https://huggingface.co/yhavinga/gpt-neo-1.3B-dutch) | gpt neo | 1.3B   | 512           | 16.0 | 2.77 | 16         | 1      | 960000/3049896  | adafactor | 5e-4   | 7d 11h   | full      |
+## Acknowledgements
+This project would not have been possible without compute generously provided by Google through the
+[TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
+instrumental in most, if not all, parts of the training. The following repositories where helpful in setting up the TPU-VM,
+and training the models:
+* [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
+* [HUggingFace Flax MLM examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
+* [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian)
+* [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
+Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)