Icelandic GPT-2 model

This Icelandic GPT-2 language model was pretrained on the Icelandic Gigaword Corpus (IGC, 2020 version), which contains approximately 1.532 million running words. The model was trained for 20 epochs on a TPU v3-8, with a total training time of 3 days and 21 hours. The hyperparameters used for training can be found in the JAX/Flax documentation for the Transformers library. The model uses a byte-level BPE tokenizer with a vocabulary size of 51,000.

Note: This model was pretrained on a tokenized and sentence-segmentized version of the IGC, which is reflected by the generated text. A new version of this model, trained on a pre-tokenized version of IGC (2022 version), is forthcoming.

Acknowledgments

This research was supported with Cloud TPUs from Google's TPU Research Cloud (TRC).