rdiehlmartinez commited on
Commit
ae00997
Β·
verified Β·
1 Parent(s): a547025

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -40,20 +40,21 @@ Each model includes:
40
 
41
  ### Available Datasets
42
  1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
43
- - 420B tokens of pre-processed, tokenized and shuffled text extraced from the **DOLMA**[https://allenai.org/dolma] corpus
44
  - We use this dataset to train our model suite
45
 
46
  2. **[pretokenized-dolma-tiny](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tiny)**
47
  - A smaller version of the **pretokenized-dolma** corpus for quick experiments
48
 
49
  3. **[pretokenized-paloma](https://huggingface.co/datasets/pico-lm/pretokenized-paloma)**
50
- - A tokenized and shuffled version of the **Paloma**[vhttps://allenai.org/evaluation-frameworks] evaluation corpus
51
  - The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
52
  - We use this corpus to evaluate the perplexity of our models
53
 
54
  4. **[pretokenized-paloma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-paloma-tinsy)**
55
  - A sub-sampled version of the **pretokenized-dolma** corpus
56
 
 
57
 
58
  ## πŸ”§ GitHub Training Framework
59
 
@@ -116,8 +117,7 @@ Standard configuration (customizable in GitHub training):
116
  - Weight decay: 0.1
117
  - Gradient clipping: 1.0
118
  - Mixed precision training
119
- - Vocab size: 50280 (using the **[OLMo Tokenizer]**(https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json))
120
-
121
 
122
  ## πŸ”¬ Research Applications
123
 
 
40
 
41
  ### Available Datasets
42
  1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
43
+ - 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus
44
  - We use this dataset to train our model suite
45
 
46
  2. **[pretokenized-dolma-tiny](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tiny)**
47
  - A smaller version of the **pretokenized-dolma** corpus for quick experiments
48
 
49
  3. **[pretokenized-paloma](https://huggingface.co/datasets/pico-lm/pretokenized-paloma)**
50
+ - A tokenized and shuffled version of the **[Paloma](https://allenai.org/evaluation-frameworks)** evaluation corpus
51
  - The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
52
  - We use this corpus to evaluate the perplexity of our models
53
 
54
  4. **[pretokenized-paloma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-paloma-tinsy)**
55
  - A sub-sampled version of the **pretokenized-dolma** corpus
56
 
57
+ All datasets are tokenized using the **[OLMo Tokenizer](https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json)**
58
 
59
  ## πŸ”§ GitHub Training Framework
60
 
 
117
  - Weight decay: 0.1
118
  - Gradient clipping: 1.0
119
  - Mixed precision training
120
+ - Vocab size: 50280
 
121
 
122
  ## πŸ”¬ Research Applications
123