Spaces:

pico-lm
/

README

Running

rdiehlmartinez commited on Feb 24

Commit

ae00997

verified ·

1 Parent(s): a547025

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -40,20 +40,21 @@ Each model includes:
 ### Available Datasets
 1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
-   - 420B tokens of pre-processed, tokenized and shuffled text extraced from the **DOLMA**[https://allenai.org/dolma] corpus
    - We use this dataset to train our model suite
 2. **[pretokenized-dolma-tiny](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tiny)**
    - A smaller version of the **pretokenized-dolma** corpus for quick experiments
 3. **[pretokenized-paloma](https://huggingface.co/datasets/pico-lm/pretokenized-paloma)**
-   - A tokenized and shuffled version of the **Paloma**[vhttps://allenai.org/evaluation-frameworks] evaluation corpus
    - The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
    - We use this corpus to evaluate the perplexity of our models
 4. **[pretokenized-paloma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-paloma-tinsy)**
    - A sub-sampled version of the **pretokenized-dolma** corpus
 ## 🔧 GitHub Training Framework
@@ -116,8 +117,7 @@ Standard configuration (customizable in GitHub training):
 - Weight decay: 0.1
 - Gradient clipping: 1.0
 - Mixed precision training
-- Vocab size: 50280 (using the **[OLMo Tokenizer]**(https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json))
 ## 🔬 Research Applications

 ### Available Datasets
 1. **[pretokenized-dolma](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)**
+   - 420B tokens of pre-processed, tokenized and shuffled text extraced from the **[DOLMA](https://allenai.org/dolma)** corpus
    - We use this dataset to train our model suite
 2. **[pretokenized-dolma-tiny](https://huggingface.co/datasets/pico-lm/pretokenized-dolma-tiny)**
    - A smaller version of the **pretokenized-dolma** corpus for quick experiments
 3. **[pretokenized-paloma](https://huggingface.co/datasets/pico-lm/pretokenized-paloma)**
+   - A tokenized and shuffled version of the **[Paloma](https://allenai.org/evaluation-frameworks)** evaluation corpus
    - The Paloma corpus was carefully curated to be disjoint from the Dolma corpus and provides
    - We use this corpus to evaluate the perplexity of our models
 4. **[pretokenized-paloma-tinsy](https://huggingface.co/datasets/pico-lm/pretokenized-paloma-tinsy)**
    - A sub-sampled version of the **pretokenized-dolma** corpus
+All datasets are tokenized using the **[OLMo Tokenizer](https://huggingface.co/allenai/OLMo-7B-0724-hf/blob/main/tokenizer_config.json)**
 ## 🔧 GitHub Training Framework
 - Weight decay: 0.1
 - Gradient clipping: 1.0
 - Mixed precision training
+- Vocab size: 50280
 ## 🔬 Research Applications