Update README.md
Browse files
README.md
CHANGED
@@ -20,7 +20,7 @@ datasets:
|
|
20 |
|
21 |
<img align="center" src="https://huggingface.co/ltg/norbert3-base/resolve/main/norbert.png" width=12.5%>
|
22 |
|
23 |
-
NorMistral-7b-scratch is a large Norwegian language model pretrained from scratch on a total of 260 billion tokens (using six repetitions of open Norwegian texts).
|
24 |
|
25 |
This model is a part of the NORA-LLM family developed in collaboration between [the Language Technology Group at the University of Oslo](https://huggingface.co/ltg), [the High Performance Language Technologies (HPLT) project team](https://hplt-project.org/), [the National Library of Norway](https://huggingface.co/NbAiLab), and [the University of Turku](https://huggingface.co/TurkuNLP).
|
26 |
All the models are pre-trained on the same dataset and with the same tokenizer.
|
@@ -40,10 +40,9 @@ _____
|
|
40 |
## Pretraining corpus
|
41 |
|
42 |
The model is pretrained exclusively on publicly available data. We combine the resources from [the public part of the NCC corpus](https://huggingface.co/datasets/NbAiLab/NCC), from [the cleaned HPLT corpus](https://hplt-project.org/datasets/v1.2), and from [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX).
|
43 |
-
This resulted in over 34B tokens of Norwegian (Bokmål or Nynorsk) in total.
|
44 |
We also augment the corpus with [Starcoder](https://huggingface.co/datasets/vikp/starcoder_filtered); 20% of the 260B tokens are sampled from this code corpus.
|
45 |
-
The
|
46 |
-
|
47 |
|
48 |
|
49 |
_____
|
|
|
20 |
|
21 |
<img align="center" src="https://huggingface.co/ltg/norbert3-base/resolve/main/norbert.png" width=12.5%>
|
22 |
|
23 |
+
NorMistral-7b-scratch is a large Norwegian language model pretrained from scratch on a total of 260 billion subword tokens (using six repetitions of open Norwegian texts).
|
24 |
|
25 |
This model is a part of the NORA-LLM family developed in collaboration between [the Language Technology Group at the University of Oslo](https://huggingface.co/ltg), [the High Performance Language Technologies (HPLT) project team](https://hplt-project.org/), [the National Library of Norway](https://huggingface.co/NbAiLab), and [the University of Turku](https://huggingface.co/TurkuNLP).
|
26 |
All the models are pre-trained on the same dataset and with the same tokenizer.
|
|
|
40 |
## Pretraining corpus
|
41 |
|
42 |
The model is pretrained exclusively on publicly available data. We combine the resources from [the public part of the NCC corpus](https://huggingface.co/datasets/NbAiLab/NCC), from [the cleaned HPLT corpus](https://hplt-project.org/datasets/v1.2), and from [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX).
|
43 |
+
This resulted in over 34B subword tokens of Norwegian (Bokmål or Nynorsk) in total, which amounts to about 26.7B whitespace-separated tokens.
|
44 |
We also augment the corpus with [Starcoder](https://huggingface.co/datasets/vikp/starcoder_filtered); 20% of the 260B tokens are sampled from this code corpus.
|
45 |
+
The natural language data is repeated six times to get the pretraining budget of 260B tokens, in accordance with findings from [Muennighoff et al. (2023)](https://neurips.cc/virtual/2023/poster/70706).
|
|
|
46 |
|
47 |
|
48 |
_____
|