Update README.md

ca76e4b verified 8 months ago

3.93 kB

	---
	license: mit
	datasets:
	- sail/regmix-data
	- sail/regmix-data-sample
	language:
	- en
	tags:
	- regmix
	---


	# Models Trained with Human Selection

	This is a collection of the language models trained using Pile-CC, each with approximately 1B parameters, trained on different seeds. This project aims to validate the generalization capabilities of the RegMix approach (https://huggingface.co/papers/2407.01492) from small-scale (e.g., 1M parameters) to large-scale (e.g., 1B parameters) models.

	## Key Features

	- Model Size: 5 separate models trained with different seeds, each with ~1B parameters
	- Training Data: The pile-cc only data mixture on the [RegMix-Data](https://huggingface.co/datasets/sail/regmix-data) dataset

	## Dataset

	The models were trained using the [RegMix-Data](https://huggingface.co/datasets/sail/regmix-data) dataset, which is split into different domains from The Pile dataset.

	## Training Hyperparameters

	\| Hyperparameter \| Value \|
	\|:---------------\|:------\|
	\| Batch Size \| 1M tokens \|
	\| Learning Rate \| 4e-4 \|
	\| Minimum Learning Rate \| 1e-5 \|
	\| Learning Rate Schedule \| Cosine \|
	\| Warmup Ratio \| 4% \|
	\| Total Tokens \| 25B \|

	## How to Load a Model

	You can load any model using the corresponding branch with the Hugging Face Transformers library:

	```python
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained("sail/data-mixture-pile-cc-1b", revision="seed-1")
	tokenizer = AutoTokenizer.from_pretrained("sail/data-mixture-pile-cc-1b", revision="seed-1")
	```

	## Data Mixture

	The specific data mixture used for training this 1B model is as follows, which can be also found in [our code](https://github.com/sail-sg/regmix/blob/main/mixture_config/config_1b/human.yaml):

	```yaml
	train:
	train_the_pile_pile_cc: 1.0
	valid:
	valid_the_pile_pile_cc: 1.0
	model_name: tinyllama_1_1b
	```

	## Model Variants

	To access different model variants, simply change the `revision` parameter in the `from_pretrained` method to the desired seed (e.g., "seed-2", "seed-3"), and the maxium seed is 5.

	## Model Performance

	We evaluated each model using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). The performance metric for each task is the average of 0-shot to 5-shot `accnorm` (accuracy normalized, if available) or `acc` (accuracy) scores.

	\| Seed \| PIQA \| LAMBADA \| MultiRC \| LogiQA \| SocialIQA \| Winogrande \| RACE \| OpenBookQA \| COPA \| HellaSwag \| SciQ \| ARC Easy \| QQP \| Average \|
	\|------\|------\|---------\|---------\|--------\|-----------\|------------\|------\|------------\|------\|-----------\|------\|----------\|-----\|---------\|
	\| 1 \| 69.23 \| 33.16 \| 50.33 \| 27.57 \| 33.22 \| 52.10 \| 31.80 \| 31.07 \| 65.83 \| 44.15 \| 81.77 \| 51.80 \| 57.04 \| 48.39 \|
	\| 2 \| 68.62 \| 33.69 \| 53.15 \| 25.13 \| 32.96 \| 51.24 \| 31.06 \| 30.84 \| 69.80 \| 43.28 \| 83.18 \| 52.00 \| 58.06 \| 48.69 \|
	\| 3 \| 69.04 \| 35.68 \| 52.38 \| 26.36 \| 33.45 \| 51.95 \| 30.83 \| 30.16 \| 66.80 \| 42.80 \| 83.32 \| 51.57 \| 57.69 \| 48.62 \|
	\| 4 \| 69.35 \| 33.56 \| 50.01 \| 26.24 \| 33.62 \| 50.99 \| 31.81 \| 30.44 \| 65.60 \| 43.00 \| 83.00 \| 52.33 \| 56.14 \| 48.16 \|
	\| 5 \| 67.91 \| 35.09 \| 49.93 \| 27.50 \| 33.90 \| 52.85 \| 31.77 \| 30.04 \| 69.40 \| 42.62 \| 80.94 \| 51.25 \| 61.03 \| 48.79 \|


	## Usage Notes

	- These models are primarily intended for research purposes.
	- Performance may vary depending on the specific task and domain.

	## Citation

	If you use these models in your research, please cite the RegMix paper:

	```
	@article{liu2024regmix,
	title={RegMix: Data Mixture as Regression for Language Model Pre-training},
	author={Liu, Qian and Zheng, Xiaosen and Muennighoff, Niklas and Zeng, Guangtao and Dou, Longxu and Pang, Tianyu and Jiang, Jing and Lin, Min},
	journal={arXiv preprint arXiv:2407.01492},
	year={2024}
	}
	```

	For more information about the RegMix methodology and its applications, please refer to the [original paper](https://huggingface.co/papers/2407.01492).