|
--- |
|
datasets: |
|
- EleutherAI/pile |
|
language: |
|
- en |
|
pipeline_tag: text2text-generation |
|
tags: |
|
- t5x |
|
- encoder-decoder |
|
--- |
|
|
|
Pile-T5 Large is an Encoder-Decoder model trained on [the Pile](https://pile.eleuther.ai/) using the [T5x](https://github.com/google-research/t5x) library. The model was trained for 2 million steps or roughly 2 trillion tokens using MLM-objective similar to the original T5 model. |
|
The HF version of Pile-T5 Large borrows UMT5's model implementation as it uses scalable model implementation from T5x and uses `LlamaTokenizer`. |
|
|
|
### Model Details |
|
|
|
- Developed by: [EleutherAI](http://eleuther.ai) |
|
- Model type: Transformer-based Language Model |
|
- Language: English |
|
- Learn more: [Blogpost](). For details about the training dataset, |
|
see [the Pile paper](https://arxiv.org/abs/2101.00027), and [its data |
|
sheet](https://arxiv.org/abs/2201.07311). |
|
- License: Apache 2.0 |
|
- Contact: to ask questions about this model, join the [EleutherAI |
|
Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`. |
|
Please read the existing GPT-NeoX-20B documentation before asking about the model |
|
on Discord. For general correspondence: [contact@eleuther. |
|
ai](mailto:[email protected]). |
|
|
|
<figure style="width:30em"> |
|
|
|
| Hyperparameter | Value | |
|
| -------------------------- | ----------- | |
|
| n<sub>parameters</sub> | 783173632 | |
|
| n<sub>encoder layers</sub> | 24 | |
|
| n<sub>decoder layers</sub> | 24 | |
|
| d<sub>model</sub> | 2816 | |
|
| d<sub>emb</sub> | 1024 | |
|
| n<sub>heads</sub> | 16 | |
|
| d<sub>head</sub> | 64 | |
|
| n<sub>vocab</sub> | 32128 | |
|
| Sequence Length | 512 | |
|
</figure> |
|
|
|
### Uses and limitations |
|
|
|
#### Intended use |
|
|
|
Pile-T5 was developed primarily for research purposes. It learns an inner |
|
representation of the English language that can be used to extract features |
|
useful for downstream tasks. |
|
|
|
In addition to scientific uses, you may also further fine-tune and adapt |
|
Pile-T5 for deployment, as long as your use is in accordance with the |
|
Apache 2.0 license. This model works with the [Transformers |
|
Library](https://huggingface.co/docs/transformers/index). If you decide to use |
|
pre-trained Pile-T5 as a basis for your fine-tuned model, please note that |
|
you need to conduct your own risk and bias assessment. |
|
|
|
#### Out-of-scope use |
|
|
|
Pile-T5 is **not** intended for deployment as-is. It is not a product |
|
and cannot be used for human-facing interactions without supervision. |
|
|
|
Pile-T5 has not been fine-tuned for downstream tasks for which language |
|
models are commonly deployed, such as writing genre prose, or commercial |
|
chatbots. This means Pile-T5 will likely **not** respond to a given prompt |
|
the way products such as ChatGPT do. This is because, unlike Pile-T5, |
|
ChatGPT was fine-tuned using methods such as Reinforcement Learning from Human |
|
Feedback (RLHF) to better “understand” human instructions and dialogue. |
|
|
|
This model is English-language only, and thus cannot be used for translation |
|
or generating text in other languages. |
|
|
|
#### Limitations and biases |
|
|
|
The core functionality of Pile-T5 is to take a string of text that has been |
|
partially replaced with mask tokens and predict a sequence of tokens that would |
|
replace those mask tokens. Remember that the statistically most likely sequence |
|
of tokens need not result in the most “accurate” text. Never rely on Pile-T5 to produce |
|
factually accurate output. |
|
|
|
This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset |
|
known to contain profanity and texts that are lewd or otherwise offensive. |
|
See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a |
|
discussion of documented biases with regards to gender, religion, and race. |
|
Pile-T5 may produce socially unacceptable or undesirable text, *even if* |
|
the prompt itself does not include anything explicitly offensive. |
|
|
|
We recommend curating the outputs of this model before presenting it to a human |
|
reader. Please inform your audience that you are using artificially generated |
|
text. |
|
|
|
#### How to use |
|
|
|
Pile-T5 can be loaded using the `AutoModelForSeq2SeqLM` functionality: |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pile-t5-large") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("EleutherAI/pile-t5-large") |
|
``` |
|
|
|
### Training |
|
|
|
#### Training dataset |
|
|
|
The Pile is a 825GiB general-purpose dataset in English. It was created by |
|
EleutherAI specifically for training large language models. It contains texts |
|
from 22 diverse sources, roughly broken down into five categories: academic |
|
writing (e.g. arXiv), internet (e.g. CommonCrawl), prose (e.g. Project |
|
Gutenberg), dialogue (e.g. YouTube subtitles), and miscellaneous (e.g. GitHub, |
|
Enron Emails). See [the Pile paper](https://arxiv.org/abs/2101.00027) for |
|
a breakdown of all data sources, methodology, and a discussion of ethical |
|
implications. Consult [the datasheet](https://arxiv.org/abs/2201.07311) for |
|
more detailed documentation about the Pile and its component datasets. The |
|
Pile can be downloaded from the [official website](https://pile.eleuther.ai/), |
|
or from a [community mirror](https://the-eye.eu/public/AI/pile/). |
|
|
|
The Pile was deduplicated before being used to train Pile-T5. |
|
|
|
#### Training procedure |
|
|
|
Pile-T5 was trained with a batch size of approximately 1M tokens |
|
(2048 sequences of 512 tokens each), for a total of 2,000,000 steps. Pile-T5 was trained |
|
with the span-corruption objective. |
|
|
|
#### Training checkpoints |
|
|
|
Intermediate checkpoints for Pile-T5 are accessible within this repository. |
|
There are in total 200 checkpoints that are spaced 10,000 steps. For T5x-native |
|
checkpoints that can be used for finetuning with the T5x library, refer to [here](https://huggingface.co/lintang/pile-t5-large-t5x) |
|
|
|
The training loss (in tfevent format) and validation perplexity (in jsonl) can be found [here](https://huggingface.co/EleutherAI/pile-t5-large/blob/main/large.zip). |
|
|
|
### Evaluations |
|
|
|
Pile-T5 Large was evaluated on SuperGLUE, CodeXGLUE. A Flan-finetuned version was evaluated on Flan Held In tasks. |
|
Results can be seen in the [blogpost](https://blog.eleuther.ai/pile-t5/) |
|
|
|
### BibTeX |
|
|
|
``` |
|
@misc{2024PileT5, |
|
author = {Lintang Sutawika and Aran Komatsuzaki and Colin Raffel}, |
|
title = {Pile-T5}, |
|
year = {2024}, |
|
url = {https://blog.eleuther.ai/pile-t5/}, |
|
note = {Blog post}, |
|
} |
|
``` |
|
|
|
|