ltgoslo commited on
Commit
8eceaae
1 Parent(s): aca8cfd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -4
README.md CHANGED
@@ -20,7 +20,7 @@ datasets:
20
 
21
  <img align="center" src="https://huggingface.co/ltg/norbert3-base/resolve/main/norbert.png" width=12.5%>
22
 
23
- NorMistral-7b-scratch is a large Norwegian language model pretrained from scratch on a total of 260 billion tokens (using six repetitions of open Norwegian texts).
24
 
25
  This model is a part of the NORA-LLM family developed in collaboration between [the Language Technology Group at the University of Oslo](https://huggingface.co/ltg), [the High Performance Language Technologies (HPLT) project team](https://hplt-project.org/), [the National Library of Norway](https://huggingface.co/NbAiLab), and [the University of Turku](https://huggingface.co/TurkuNLP).
26
  All the models are pre-trained on the same dataset and with the same tokenizer.
@@ -40,10 +40,9 @@ _____
40
  ## Pretraining corpus
41
 
42
  The model is pretrained exclusively on publicly available data. We combine the resources from [the public part of the NCC corpus](https://huggingface.co/datasets/NbAiLab/NCC), from [the cleaned HPLT corpus](https://hplt-project.org/datasets/v1.2), and from [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX).
43
- This resulted in over 34B tokens of Norwegian (Bokmål or Nynorsk) in total.
44
  We also augment the corpus with [Starcoder](https://huggingface.co/datasets/vikp/starcoder_filtered); 20% of the 260B tokens are sampled from this code corpus.
45
- The Norwegian data is repeated six times to get the pretraining budget of 260B tokens, in accordance with findings from [Muennighoff et al. (2023)](https://neurips.cc/virtual/2023/poster/70706).
46
-
47
 
48
 
49
  _____
 
20
 
21
  <img align="center" src="https://huggingface.co/ltg/norbert3-base/resolve/main/norbert.png" width=12.5%>
22
 
23
+ NorMistral-7b-scratch is a large Norwegian language model pretrained from scratch on a total of 260 billion subword tokens (using six repetitions of open Norwegian texts).
24
 
25
  This model is a part of the NORA-LLM family developed in collaboration between [the Language Technology Group at the University of Oslo](https://huggingface.co/ltg), [the High Performance Language Technologies (HPLT) project team](https://hplt-project.org/), [the National Library of Norway](https://huggingface.co/NbAiLab), and [the University of Turku](https://huggingface.co/TurkuNLP).
26
  All the models are pre-trained on the same dataset and with the same tokenizer.
 
40
  ## Pretraining corpus
41
 
42
  The model is pretrained exclusively on publicly available data. We combine the resources from [the public part of the NCC corpus](https://huggingface.co/datasets/NbAiLab/NCC), from [the cleaned HPLT corpus](https://hplt-project.org/datasets/v1.2), and from [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX).
43
+ This resulted in over 34B subword tokens of Norwegian (Bokmål or Nynorsk) in total, which amounts to about 26.7B whitespace-separated tokens.
44
  We also augment the corpus with [Starcoder](https://huggingface.co/datasets/vikp/starcoder_filtered); 20% of the 260B tokens are sampled from this code corpus.
45
+ The natural language data is repeated six times to get the pretraining budget of 260B tokens, in accordance with findings from [Muennighoff et al. (2023)](https://neurips.cc/virtual/2023/poster/70706).
 
46
 
47
 
48
  _____