yhavinga commited on
Commit
f420234
1 Parent(s): 49928d0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -27
README.md CHANGED
@@ -6,42 +6,76 @@ widget:
6
  - text: "Studenten en leraren van de Bogazici Universiteit in de Turkse stad Istanbul"
7
  - text: "In Israël was een strenge lockdown"
8
  tags:
9
- - gpt-neo-125M
10
- - gpt-neo
11
- - text generation
12
- - pytorch
13
- - causal-lm
14
  pipeline_tag: text-generation
15
  datasets:
16
  - yhavinga/mc4_nl_cleaned
17
  ---
18
- # # GPT-Neo 125M pre-trained on cleaned Dutch mC4 🇳🇱
19
 
20
- Dataset:
21
 
22
- * [mC4 NL Cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned)
23
- * dataset config: mc4 nl filtered with only newspapers and wikipedia
24
- * total tokens: 3.9B
25
 
26
- Tokenizer:
27
 
28
- * Tokenizer trained on mC4 with scripts from the Huggingface
29
- Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
 
 
 
 
30
 
31
- Training details:
 
32
 
33
- * Trained for 558608 steps with batch size 128
34
- * Optimizer: AdamW
35
- * Block size: 512
36
- * Learning rate: 2.4e-3
37
- * Warmup steps: 5000
38
- * Epochs: 8
39
 
40
- Jan 2022
41
 
42
- * Many thanks to the [Google TPU Research Cloud](https://sites.research.google/trc/about/) for providing access to a TPU cluster!
43
- * Thanks to @gsarti for creating the [t5-flax-gcp
44
- repository](https://github.com/gsarti/t5-flax-gcp).
45
- * Also thanks to the creators of [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian) and
46
- [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
47
- for sharing their training scripts!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  - text: "Studenten en leraren van de Bogazici Universiteit in de Turkse stad Istanbul"
7
  - text: "In Israël was een strenge lockdown"
8
  tags:
9
+ - gpt2-medium
10
+ - gpt2
 
 
 
11
  pipeline_tag: text-generation
12
  datasets:
13
  - yhavinga/mc4_nl_cleaned
14
  ---
15
+ # GPT-Neo 125M pre-trained on cleaned Dutch mC4 🇳🇱
16
 
17
+ A GPT-Neo small model (125M paramters) trained from scratch on Dutch, with perplexity 19.9 on cleaned Dutch mC4.
18
 
19
+ ## How To Use
 
 
20
 
21
+ You can use this GPT-Neo model directly with a pipeline for text generation.
22
 
23
+ ```python
24
+ MODEL_DIR='yhavinga/gpt-neo-125M-dutch'
25
+ from transformers import pipeline, GPT2Tokenizer, GPTNeoForCausalLM
26
+ tokenizer = GPT2Tokenizer.from_pretrained(MODEL_DIR)
27
+ model = GPTNeoForCausalLM.from_pretrained(MODEL_DIR)
28
+ generator = pipeline('text-generation', model, tokenizer=tokenizer)
29
 
30
+ generated_text = generator('Wetenschappers verbonden aan de Katholieke Universiteit', max_length=100, do_sample=True, top_k=40, top_p=0.95, repetition_penalty=2.0))
31
+ ```
32
 
33
+ *"Wetenschappers verbonden aan de Katholieke Universiteit van Nijmegen" - "hebben er in het laatste nummer dat deze week verschijnt nog niets over gezegd. De wetenschappers verwachten pas volgend jaar meer duidelijkheid te kunnen geven, zo blijkt uit onderzoek door een vakblad en op Facebook onder studenten die denken mee te moeten werken om hun studie af te maken.
34
+ In augustus 2017 kwam al naar buiten wat eraan schortten: hogescholen zouden moeite hebben met excel-software, ze hadden niet voldoende tijd om alle"*
 
 
 
 
35
 
36
+ ## Tokenizer
37
 
38
+ * BPE tokenizer trained from scratch for Dutch on mC4 nl cleaned with scripts from the Huggingface
39
+ Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling).
40
+
41
+ ## Dataset
42
+
43
+ This model was trained on the wikipedia and newspapers (3.9B tokens) webpages in
44
+ [cleaned Dutch mC4](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned),
45
+ which is the original mC4, except
46
+
47
+ * Documents that contained words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) are removed
48
+ * Sentences with less than 3 words are removed
49
+ * Sentences with a word of more than 1000 characters are removed
50
+ * Documents with less than 5 sentences are removed
51
+ * Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies",
52
+ "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
53
+
54
+ ## Models
55
+
56
+ TL;DR: [yhavinga/gpt2-medium-dutch](https://huggingface.co/yhavinga/gpt2-medium-dutch) is the best model.
57
+
58
+ * `yhavinga/gpt-neo-125M-dutch` is trained on a fraction of C4 containing only wikipedia and news sites.
59
+ * The models with `a`/`b` in the step-column have been trained to step `a` of a total of `b` steps.
60
+
61
+ | | model | params | train seq len | ppl | loss | batch size | epochs | steps | optim | lr | duration | config |
62
+ |-----------------------------------------------------------------------------------|---------|--------|---------------|------|------|------------|--------|-----------------|-----------|--------|----------|-----------|
63
+ | [yhavinga/gpt-neo-125M-dutch](https://huggingface.co/yhavinga/gpt-neo-125M-dutch) | gpt neo | 125M | 512 | 19.9 | 2.99 | 128 | 8 | 558608 | adamw | 2.4e-3 | 1d 12h | news+wiki |
64
+ | [yhavinga/gpt2-medium-dutch](https://huggingface.co/yhavinga/gpt2-medium-dutch) | gpt2 | 345M | 512 | 15.1 | 2.71 | 128 | 4 | 320000/520502 | adafactor | 8e-4 | 7d 2h | full |
65
+ | [yhavinga/gpt2-large-dutch](https://huggingface.co/yhavinga/gpt2-large-dutch) | gpt2 | 762M | 512 | 15.1 | 2.72 | 32 | 1 | 1100000/2082009 | adafactor | 3.3e-5 | 8d 15h | large |
66
+ | [yhavinga/gpt-neo-1.3B-dutch](https://huggingface.co/yhavinga/gpt-neo-1.3B-dutch) | gpt neo | 1.3B | 512 | 16.0 | 2.77 | 16 | 1 | 960000/3049896 | adafactor | 5e-4 | 7d 11h | full |
67
+
68
+
69
+ ## Acknowledgements
70
+
71
+ This project would not have been possible without compute generously provided by Google through the
72
+ [TPU Research Cloud](https://sites.research.google/trc/). The HuggingFace 🤗 ecosystem was also
73
+ instrumental in most, if not all, parts of the training. The following repositories where helpful in setting up the TPU-VM,
74
+ and training the models:
75
+
76
+ * [Gsarti's Pretrain and Fine-tune a T5 model with Flax on GCP](https://github.com/gsarti/t5-flax-gcp)
77
+ * [HUggingFace Flax MLM examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
78
+ * [gpt2-medium-persian](https://huggingface.co/flax-community/gpt2-medium-persian)
79
+ * [gpt2-medium-indonesian](https://huggingface.co/flax-community/gpt2-medium-persian)
80
+
81
+ Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/)