initial commit

ac6e7c1 over 1 year ago

4.11 kB

	---
	license: mit
	language:
	- en
	---

	# 🚀 Nucleus-22B-token-500B

	Nucleus-22B-token-500B is a 22B parameters causal decoder-only model built by Nucleus.AI and trained on 500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) along with curated corpora. It is made available under the MIT license.

	1T-token model coming soon 😊.


	## What about Nucleus-22B-token-500B?

	* It performs well compared to similar-size open-source models (e.g., [MPT-7B](https://huggingface.co/mosaicml/mpt-7b), [StableLM](https://github.com/Stability-AI/StableLM), [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1) etc.), thanks to being trained on 500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) enhanced with curated corpora. See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
	* It is made available under an MIT license.
	* It is trained by a small team of four passionate for Open Source

	⚠️ This is a raw, pretrained model, which should be further finetuned for most usecases.

	# Model Card for Nucleus-22B-token-500B

	## Model Details

	### Model Description

	- Developed by: NucleusAI;
	- Model type: Causal decoder-only;
	- Language(s) (NLP): English;
	- License: MIT.

	### Model Source

	- Paper: coming soon.

	## Uses

	### Direct Use

	Research on large language models; as a foundation for further specialization and finetuning for specific usecases (e.g., summarization, text generation, chatbot, etc.)

	### Out-of-Scope Use

	Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.

	## Bias, Risks, and Limitations

	Nucleus-22B-token-500B is trained on English data only, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.

	### Recommendations

	We recommend users of Nucleus-22B-token-500B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.

	## How to Get Started with the Mode


	## Training Details

	### Training Data

	Nucleus-22B-token-500B was trained on 500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), along with other corpora.

	\| Data source \| Fraction \| Tokens \| Sources \|
	\|--------------------\|--------------\|------------\|-----------------------------------\|
	\| [RefinedWeb-English](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) \| 75% \| 200B \| massive web crawl \|
	\| Books \| 7% \| 21B \| \|
	\| Code \| 7% \| 21B \| Big Code, CodeNet \|
	\| Technical \| 6% \| 19B \| arXiv \|
	\| Math \| 5% \| 17B \| Mathematica, Khan Academy \|


	The data was tokenized with the tokenizer similar to Llama-[7B](https://huggingface.co/meta-llama/Llama-2-7b).

	### Training Procedure

	Nucleus-22B-token-500B was trained on 256 A100 80GB GPUs, using a FSDP

	#### Training Hyperparameters

	\| Hyperparameter \| Value \| Comment \|
	\|--------------------\|------------\|-------------------------------------------\|
	\| Precision \| `bfloat16` \| \|
	\| Optimizer \| AdamW \| \|
	\| Learning rate \| 2e-4 \| 8B tokens warm-up, cosine decay to 1.e-5 \|
	\| Weight decay \| 1e-1 \| \|
	\| Batch size \| 2048 \| constant \|
	\| Context length \| 2048 \| constant \|


	#### Speeds, Sizes, Times

	Training happened in early August 2023 and took about two weeks.