readme: add initial versioN
Browse files
README.md
ADDED
@@ -0,0 +1,86 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: de
|
3 |
+
|
4 |
+
widget:
|
5 |
+
- text: "Heute ist sehr schönes Wetter in"
|
6 |
+
|
7 |
+
license: mit
|
8 |
+
---
|
9 |
+
|
10 |
+
# German GPT-2 model
|
11 |
+
|
12 |
+
In this repository we release (yet another) GPT-2 model, that was trained on various texts for German.
|
13 |
+
|
14 |
+
The model is meant to be an entry point for fine-tuning on other texts, and it is definitely not as good or "dangerous" as the English GPT-3 model. We do not plan extensive PR or staged releases for this model 😉
|
15 |
+
|
16 |
+
**Note**: The model was initially released under an anonymous alias (`anonymous-german-nlp/german-gpt2`) so we now "de-anonymize" it ;)
|
17 |
+
|
18 |
+
More details about GPT-2 can be found in the great [Hugging Face](https://huggingface.co/transformers/model_doc/gpt2.html) documentation.
|
19 |
+
|
20 |
+
# Changelog
|
21 |
+
|
22 |
+
15.11.2020: Initial release.
|
23 |
+
|
24 |
+
# Training corpora
|
25 |
+
|
26 |
+
We use pretty much the same corpora as used for training the DBMDZ BERT model, that can be found in [this repository](https://github.com/dbmdz/berts).
|
27 |
+
|
28 |
+
Thanks to the awesome Hugging Face team, it is possible to create byte-level BPE with their awesome [Tokenizers](https://github.com/huggingface/tokenizers) library.
|
29 |
+
|
30 |
+
With the previously mentioned awesome Tokenizers library we created a 52K byte-level BPE vocab based on the training corpora.
|
31 |
+
|
32 |
+
After creating the vocab, we could train the GPT-2 for German on one TPU over the complete training corpus (three epochs).
|
33 |
+
|
34 |
+
# Using the model
|
35 |
+
|
36 |
+
The model itself can be used in this way:
|
37 |
+
|
38 |
+
```python
|
39 |
+
from transformers import AutoTokenizer, AutoModelWithLMHead
|
40 |
+
|
41 |
+
tokenizer = AutoTokenizer.from_pretrained("dbmdz/german-gpt2")
|
42 |
+
|
43 |
+
model = AutoModelWithLMHead.from_pretrained("dbmdz/german-gpt2")
|
44 |
+
```
|
45 |
+
|
46 |
+
However, text generation is a bit more interesting, so here's an example that shows how to use the great Transformers *Pipelines* for generating text:
|
47 |
+
|
48 |
+
```python
|
49 |
+
from transformers import pipeline
|
50 |
+
|
51 |
+
pipe = pipeline('text-generation', model="dbmdz/german-gpt2",
|
52 |
+
tokenizer="dbmdz/german-gpt2", config={'max_length':800})
|
53 |
+
|
54 |
+
text = pipe2("Der Sinn des Lebens ist es")[0]["generated_text"]
|
55 |
+
|
56 |
+
print(text)
|
57 |
+
```
|
58 |
+
|
59 |
+
This could output this beautiful text:
|
60 |
+
|
61 |
+
```
|
62 |
+
Der Sinn des Lebens ist es, im Geist zu verweilen, aber nicht in der Welt zu sein, sondern ganz im Geist zu leben.
|
63 |
+
Die Menschen beginnen, sich nicht nach der Natur und nach der Welt zu richten, sondern nach der Seele,'
|
64 |
+
```
|
65 |
+
|
66 |
+
# License
|
67 |
+
|
68 |
+
All models are licensed under [MIT](LICENSE).
|
69 |
+
|
70 |
+
# Huggingface model hub
|
71 |
+
|
72 |
+
All models are available on the [Huggingface model hub](https://huggingface.co/dbmdz).
|
73 |
+
|
74 |
+
# Contact (Bugs, Feedback, Contribution and more)
|
75 |
+
|
76 |
+
For questions about our BERT models just open an issue
|
77 |
+
[here](https://github.com/stefan-it/german-gpt/issues/new) 🤗
|
78 |
+
|
79 |
+
# Acknowledgments
|
80 |
+
|
81 |
+
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
|
82 |
+
Thanks for providing access to the TFRC ❤️
|
83 |
+
|
84 |
+
Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
|
85 |
+
it is possible to download both cased and uncased models from their S3 storage 🤗
|
86 |
+
|