benjamin commited on
Commit
ab8568f
·
1 Parent(s): f246c8a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -16
README.md CHANGED
@@ -7,108 +7,158 @@ widget:
7
  license: mit
8
  ---
9
 
10
- # GerPT2
11
 
12
- A small German GPT2.
13
 
14
- Also check out [GerPT2-large](https://huggingface.co/benjamin/gerpt2-large), a large version of this model.
 
 
15
 
16
  See the [GPT2 model card](https://huggingface.co/gpt2) for considerations on limitations and bias. See the [GPT2 documentation](https://huggingface.co/transformers/model_doc/gpt2.html) for details on GPT2.
17
 
18
  ## Comparison to [dbmdz/german-gpt2](https://huggingface.co/dbmdz/german-gpt2)
19
 
20
- I evaluated both GerPT2 and the other German GPT2, [dbmdz/german-gpt2](https://huggingface.co/dbmdz/german-gpt2) on the [CC-100](http://data.statmt.org/cc-100/) dataset and on the German Wikipedia:
21
 
22
  | | CC-100 (PPL) | Wikipedia (PPL) |
 
23
  |-------------------|--------------|-----------------|
 
24
  | dbmdz/german-gpt2 | 49.47 | 62.92 |
 
25
  | GerPT2 | 24.78 | 35.33 |
26
- | GerPT2-large | 16.08 | 23.26 |
27
- | | | |
28
 
29
  See the script `evaluate.py` in the [GerPT2 Github repository](https://github.com/bminixhofer/gerpt2) for the code.
30
 
31
  ## Usage
32
 
33
  ```python
 
34
  from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
35
 
36
- tokenizer = AutoTokenizer.from_pretrained("benjamin/gerpt2")
37
- model = AutoModelForCausalLM.from_pretrained("benjamin/gerpt2")
 
38
 
39
  prompt = "<your prompt>"
40
 
41
  pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
 
42
  print(pipe(prompt)[0]["generated_text"])
 
43
  ```
44
 
45
  Also, two tricks might improve the generated text:
46
 
47
  ```python
 
48
  output = model.generate(
 
49
  # during training an EOS token was used to mark the beginning of each text
 
50
  # so it can help to insert it at the start
 
51
  torch.tensor(
 
52
  [tokenizer.eos_token_id] + tokenizer.encode(prompt)
 
53
  ).unsqueeze(0),
 
54
  do_sample=True,
 
55
  # try setting bad_words_ids=[[0]] to disallow generating an EOS token, without this the model is
 
56
  # prone to ending generation early because a significant number of texts from the training corpus
 
57
  # is quite short
 
58
  bad_words_ids=[[0]],
 
59
  max_length=max_length,
 
60
  )[0]
 
61
  print(tokenizer.decode(output))
 
62
  ```
63
 
64
  ## Training details
65
 
66
- GerPT2 is trained on the entire German data (67GB) from the [CC-100 Corpus](http://data.statmt.org/cc-100/) and weights were initialized from the [English GPT2 model](https://huggingface.co/gpt2).
67
- GerPT2 was trained with:
 
68
 
69
  - a batch size of 256
 
70
  - using OneCycle learning rate with a maximum of 5e-3
 
71
  - with AdamW with a weight decay of 0.01
72
- - for 7 epochs
73
 
74
- Training took roughly 6 days on 8 TPUv3 cores.
75
 
76
- To train GerPT2, follow these steps. Scripts are located in the [Github repository](https://github.com/bminixhofer/gerpt2):
 
 
77
 
78
  0. Download and unzip training data from http://data.statmt.org/cc-100/.
 
79
  1. Train a tokenizer using `prepare/train_tokenizer.py`. As training data for the tokenizer I used a random subset of 5% of the CC-100 data.
 
80
  2. (optionally) generate a German input embedding matrix with `prepare/generate_aligned_wte.py`. This uses a neat trick to semantically map tokens from the English tokenizer to tokens from the German tokenizer using aligned word embeddings. E. g.:
81
 
82
  ```
 
83
  ĠMinde -> Ġleast
 
84
  Ġjed -> Ġwhatsoever
 
85
  flughafen -> Air
 
86
  vermittlung -> employment
 
87
  teilung -> ignment
 
88
  ĠInterpretation -> Ġinterpretation
 
89
  Ġimport -> Ġimported
 
90
  hansa -> irl
 
91
  genehmigungen -> exempt
 
92
  ĠAuflist -> Ġlists
 
93
  Ġverschwunden -> Ġdisappeared
 
94
  ĠFlyers -> ĠFlyers
 
95
  Kanal -> Channel
 
96
  Ġlehr -> Ġteachers
 
97
  Ġnahelie -> Ġconvenient
 
98
  gener -> Generally
 
99
  mitarbeiter -> staff
 
100
  ```
101
 
102
  This helps a lot on a trial run I did, although I wasn't able to do a full comparison due to budget and time constraints. To use this WTE matrix it can be passed via the `wte_path` to the training script. Credit to [this blogpost](https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hugging-f2ec05c98787) for the idea of initializing GPT2 from English weights.
103
 
104
  3. Tokenize the corpus using `prepare/tokenize_text.py`. This generates files for train and validation tokens in JSON Lines format.
105
- 4. Run the training script `train.py`! `run.sh` shows how this was executed for the full run with config `configs/tpu.json`.
 
106
 
107
  ## License
108
 
109
- GerPT2 is licensed under the MIT License.
110
 
111
  ## Acknowledgements
112
 
113
  Thanks to [Hugging Face](https://huggingface.co) for awesome tools and infrastructure.
114
- Special thanks to [PetFinder.my](https://www.petfinder.my/) for generously sponsoring the resources used for training.
 
 
7
  license: mit
8
  ---
9
 
10
+ # GerPT2-large
11
 
12
+ German large and small versions of GPT2:
13
 
14
+ - https://huggingface.co/benjamin/gerpt2
15
+
16
+ - https://huggingface.co/benjamin/gerpt2-large
17
 
18
  See the [GPT2 model card](https://huggingface.co/gpt2) for considerations on limitations and bias. See the [GPT2 documentation](https://huggingface.co/transformers/model_doc/gpt2.html) for details on GPT2.
19
 
20
  ## Comparison to [dbmdz/german-gpt2](https://huggingface.co/dbmdz/german-gpt2)
21
 
22
+ I evaluated both GerPT2-large and the other German GPT2, [dbmdz/german-gpt2](https://huggingface.co/dbmdz/german-gpt2) on the [CC-100](http://data.statmt.org/cc-100/) dataset and on the German Wikipedia:
23
 
24
  | | CC-100 (PPL) | Wikipedia (PPL) |
25
+
26
  |-------------------|--------------|-----------------|
27
+
28
  | dbmdz/german-gpt2 | 49.47 | 62.92 |
29
+
30
  | GerPT2 | 24.78 | 35.33 |
31
+
32
+ | GerPT2-large | __16.08__ | __23.26__ |
33
 
34
  See the script `evaluate.py` in the [GerPT2 Github repository](https://github.com/bminixhofer/gerpt2) for the code.
35
 
36
  ## Usage
37
 
38
  ```python
39
+
40
  from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
41
 
42
+ tokenizer = AutoTokenizer.from_pretrained("benjamin/gerpt2-large")
43
+
44
+ model = AutoModelForCausalLM.from_pretrained("benjamin/gerpt2-large")
45
 
46
  prompt = "<your prompt>"
47
 
48
  pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
49
+
50
  print(pipe(prompt)[0]["generated_text"])
51
+
52
  ```
53
 
54
  Also, two tricks might improve the generated text:
55
 
56
  ```python
57
+
58
  output = model.generate(
59
+
60
  # during training an EOS token was used to mark the beginning of each text
61
+
62
  # so it can help to insert it at the start
63
+
64
  torch.tensor(
65
+
66
  [tokenizer.eos_token_id] + tokenizer.encode(prompt)
67
+
68
  ).unsqueeze(0),
69
+
70
  do_sample=True,
71
+
72
  # try setting bad_words_ids=[[0]] to disallow generating an EOS token, without this the model is
73
+
74
  # prone to ending generation early because a significant number of texts from the training corpus
75
+
76
  # is quite short
77
+
78
  bad_words_ids=[[0]],
79
+
80
  max_length=max_length,
81
+
82
  )[0]
83
+
84
  print(tokenizer.decode(output))
85
+
86
  ```
87
 
88
  ## Training details
89
 
90
+ GerPT2-large is trained on the entire German data (67GB) from the [CC-100 Corpus](http://data.statmt.org/cc-100/) and weights were initialized from the [English GPT2 model](https://huggingface.co/gpt2-large).
91
+
92
+ GerPT2-large was trained with:
93
 
94
  - a batch size of 256
95
+
96
  - using OneCycle learning rate with a maximum of 5e-3
97
+
98
  - with AdamW with a weight decay of 0.01
 
99
 
100
+ - for 2 epochs
101
 
102
+ Training took roughly 12 days on 8 TPUv3 cores.
103
+
104
+ To train GerPT2-large, follow these steps. Scripts are located in the [Github repository](https://github.com/bminixhofer/gerpt2):
105
 
106
  0. Download and unzip training data from http://data.statmt.org/cc-100/.
107
+
108
  1. Train a tokenizer using `prepare/train_tokenizer.py`. As training data for the tokenizer I used a random subset of 5% of the CC-100 data.
109
+
110
  2. (optionally) generate a German input embedding matrix with `prepare/generate_aligned_wte.py`. This uses a neat trick to semantically map tokens from the English tokenizer to tokens from the German tokenizer using aligned word embeddings. E. g.:
111
 
112
  ```
113
+
114
  ĠMinde -> Ġleast
115
+
116
  Ġjed -> Ġwhatsoever
117
+
118
  flughafen -> Air
119
+
120
  vermittlung -> employment
121
+
122
  teilung -> ignment
123
+
124
  ĠInterpretation -> Ġinterpretation
125
+
126
  Ġimport -> Ġimported
127
+
128
  hansa -> irl
129
+
130
  genehmigungen -> exempt
131
+
132
  ĠAuflist -> Ġlists
133
+
134
  Ġverschwunden -> Ġdisappeared
135
+
136
  ĠFlyers -> ĠFlyers
137
+
138
  Kanal -> Channel
139
+
140
  Ġlehr -> Ġteachers
141
+
142
  Ġnahelie -> Ġconvenient
143
+
144
  gener -> Generally
145
+
146
  mitarbeiter -> staff
147
+
148
  ```
149
 
150
  This helps a lot on a trial run I did, although I wasn't able to do a full comparison due to budget and time constraints. To use this WTE matrix it can be passed via the `wte_path` to the training script. Credit to [this blogpost](https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hugging-f2ec05c98787) for the idea of initializing GPT2 from English weights.
151
 
152
  3. Tokenize the corpus using `prepare/tokenize_text.py`. This generates files for train and validation tokens in JSON Lines format.
153
+
154
+ 4. Run the training script `train.py`! `run.sh` shows how this was executed for the full run with config `configs/tpu_large.json`.
155
 
156
  ## License
157
 
158
+ GerPT2-large is licensed under the MIT License.
159
 
160
  ## Acknowledgements
161
 
162
  Thanks to [Hugging Face](https://huggingface.co) for awesome tools and infrastructure.
163
+
164
+ Huge thanks to [Artus Krohn-Grimberghe](https://twitter.com/artuskg) at [LYTiQ](https://www.lytiq.de/) for making this possible by sponsoring the resources used for training.