Update README.md
Browse files
README.md
CHANGED
@@ -40,20 +40,21 @@ Each model includes:
|
|
40 |
|
41 |
### Available Datasets
|
42 |
1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
|
43 |
-
- 420B tokens of pre-processed, tokenized and shuffled text extraced from the **DOLMA
|
44 |
- We use this dataset to train our model suite
|
45 |
|
46 |
2. **[pretokenized-dolma-tiny](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tiny)**
|
47 |
- A smaller version of the **pretokenized-dolma** corpus for quick experiments
|
48 |
|
49 |
3. **[pretokenized-paloma](https://huggingface.co/datasets/pico-lm/pretokenized-paloma)**
|
50 |
-
- A tokenized and shuffled version of the **Paloma
|
51 |
- The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
|
52 |
- We use this corpus to evaluate the perplexity of our models
|
53 |
|
54 |
4. **[pretokenized-paloma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-paloma-tinsy)**
|
55 |
- A sub-sampled version of the **pretokenized-dolma** corpus
|
56 |
|
|
|
57 |
|
58 |
## π§ GitHub Training Framework
|
59 |
|
@@ -116,8 +117,7 @@ Standard configuration (customizable in GitHub training):
|
|
116 |
- Weight decay: 0.1
|
117 |
- Gradient clipping: 1.0
|
118 |
- Mixed precision training
|
119 |
-
- Vocab size: 50280
|
120 |
-
|
121 |
|
122 |
## π¬ Research Applications
|
123 |
|
|
|
40 |
|
41 |
### Available Datasets
|
42 |
1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
|
43 |
+
- 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus
|
44 |
- We use this dataset to train our model suite
|
45 |
|
46 |
2. **[pretokenized-dolma-tiny](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tiny)**
|
47 |
- A smaller version of the **pretokenized-dolma** corpus for quick experiments
|
48 |
|
49 |
3. **[pretokenized-paloma](https://huggingface.co/datasets/pico-lm/pretokenized-paloma)**
|
50 |
+
- A tokenized and shuffled version of the **[Paloma](https://allenai.org/evaluation-frameworks)** evaluation corpus
|
51 |
- The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
|
52 |
- We use this corpus to evaluate the perplexity of our models
|
53 |
|
54 |
4. **[pretokenized-paloma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-paloma-tinsy)**
|
55 |
- A sub-sampled version of the **pretokenized-dolma** corpus
|
56 |
|
57 |
+
All datasets are tokenized using the **[OLMo Tokenizer](https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json)**
|
58 |
|
59 |
## π§ GitHub Training Framework
|
60 |
|
|
|
117 |
- Weight decay: 0.1
|
118 |
- Gradient clipping: 1.0
|
119 |
- Mixed precision training
|
120 |
+
- Vocab size: 50280
|
|
|
121 |
|
122 |
## π¬ Research Applications
|
123 |
|