|
--- |
|
license: mit |
|
datasets: |
|
- sail/regmix-data |
|
- sail/regmix-data-sample |
|
language: |
|
- en |
|
tags: |
|
- regmix |
|
--- |
|
|
|
|
|
# Models Trained with Human Selection |
|
|
|
This is a collection of the language models trained using Pile-CC, each with approximately 1B parameters, trained on different seeds. This project aims to validate the generalization capabilities of the RegMix approach (https://huggingface.co/papers/2407.01492) from small-scale (e.g., 1M parameters) to large-scale (e.g., 1B parameters) models. |
|
|
|
## Key Features |
|
|
|
- **Model Size**: 5 separate models trained with different seeds, each with ~1B parameters |
|
- **Training Data**: The pile-cc only data mixture on the [RegMix-Data](https://huggingface.co/datasets/sail/regmix-data) dataset |
|
|
|
## Dataset |
|
|
|
The models were trained using the [RegMix-Data](https://huggingface.co/datasets/sail/regmix-data) dataset, which is split into different domains from The Pile dataset. |
|
|
|
## Training Hyperparameters |
|
|
|
| Hyperparameter | Value | |
|
|:---------------|:------| |
|
| Batch Size | 1M tokens | |
|
| Learning Rate | 4e-4 | |
|
| Minimum Learning Rate | 1e-5 | |
|
| Learning Rate Schedule | Cosine | |
|
| Warmup Ratio | 4% | |
|
| Total Tokens | 25B | |
|
|
|
## How to Load a Model |
|
|
|
You can load any model using the corresponding branch with the Hugging Face Transformers library: |
|
|
|
```python |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
model = AutoModel.from_pretrained("sail/data-mixture-pile-cc-1b", revision="seed-1") |
|
tokenizer = AutoTokenizer.from_pretrained("sail/data-mixture-pile-cc-1b", revision="seed-1") |
|
``` |
|
|
|
## Data Mixture |
|
|
|
The specific data mixture used for training this 1B model is as follows, which can be also found in [our code](https://github.com/sail-sg/regmix/blob/main/mixture_config/config_1b/human.yaml): |
|
|
|
```yaml |
|
train: |
|
train_the_pile_pile_cc: 1.0 |
|
valid: |
|
valid_the_pile_pile_cc: 1.0 |
|
model_name: tinyllama_1_1b |
|
``` |
|
|
|
## Model Variants |
|
|
|
To access different model variants, simply change the `revision` parameter in the `from_pretrained` method to the desired seed (e.g., "seed-2", "seed-3"), and the maxium seed is 5. |
|
|
|
## Model Performance |
|
|
|
We evaluated each model using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). The performance metric for each task is the average of 0-shot to 5-shot `accnorm` (accuracy normalized, if available) or `acc` (accuracy) scores. |
|
|
|
| Seed | PIQA | LAMBADA | MultiRC | LogiQA | SocialIQA | Winogrande | RACE | OpenBookQA | COPA | HellaSwag | SciQ | ARC Easy | QQP | Average | |
|
|------|------|---------|---------|--------|-----------|------------|------|------------|------|-----------|------|----------|-----|---------| |
|
| 1 | 69.23 | 33.16 | 50.33 | 27.57 | 33.22 | 52.10 | 31.80 | 31.07 | 65.83 | 44.15 | 81.77 | 51.80 | 57.04 | 48.39 | |
|
| 2 | 68.62 | 33.69 | 53.15 | 25.13 | 32.96 | 51.24 | 31.06 | 30.84 | 69.80 | 43.28 | 83.18 | 52.00 | 58.06 | 48.69 | |
|
| 3 | 69.04 | 35.68 | 52.38 | 26.36 | 33.45 | 51.95 | 30.83 | 30.16 | 66.80 | 42.80 | 83.32 | 51.57 | 57.69 | 48.62 | |
|
| 4 | 69.35 | 33.56 | 50.01 | 26.24 | 33.62 | 50.99 | 31.81 | 30.44 | 65.60 | 43.00 | 83.00 | 52.33 | 56.14 | 48.16 | |
|
| 5 | 67.91 | 35.09 | 49.93 | 27.50 | 33.90 | 52.85 | 31.77 | 30.04 | 69.40 | 42.62 | 80.94 | 51.25 | 61.03 | 48.79 | |
|
|
|
|
|
## Usage Notes |
|
|
|
- These models are primarily intended for research purposes. |
|
- Performance may vary depending on the specific task and domain. |
|
|
|
## Citation |
|
|
|
If you use these models in your research, please cite the RegMix paper: |
|
|
|
``` |
|
@article{liu2024regmix, |
|
title={RegMix: Data Mixture as Regression for Language Model Pre-training}, |
|
author={Liu, Qian and Zheng, Xiaosen and Muennighoff, Niklas and Zeng, Guangtao and Dou, Longxu and Pang, Tianyu and Jiang, Jing and Lin, Min}, |
|
journal={arXiv preprint arXiv:2407.01492}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
For more information about the RegMix methodology and its applications, please refer to the [original paper](https://huggingface.co/papers/2407.01492). |