File size: 6,502 Bytes
8905117
 
 
 
 
 
 
 
d42d873
8905117
 
fcf253e
d42d873
8905117
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d42d873
8905117
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23f8120
 
8905117
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b6986a
8905117
23f8120
8905117
 
 
d42d873
 
8905117
 
 
 
d42d873
8905117
d42d873
8905117
d42d873
 
8905117
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
datasets:
- EleutherAI/pile
language:
- en
pipeline_tag: text2text-generation
tags:
- t5x
- encoder-decoder
---

Pile-T5 Large is an Encoder-Decoder model trained on [the Pile](https://pile.eleuther.ai/) using the [T5x](https://github.com/google-research/t5x) library. The model was trained for 2 million steps or roughly 2 trillion tokens using MLM-objective similar to the original T5 model.
The HF version of Pile-T5 Large borrows UMT5's model implementation as it uses scalable model implementation from T5x and uses `LlamaTokenizer`. 

### Model Details

- Developed by: [EleutherAI](http://eleuther.ai)
- Model type: Transformer-based Language Model
- Language: English
- Learn more: [Blogpost](). For details about the training dataset, 
see [the Pile paper](https://arxiv.org/abs/2101.00027), and [its data
sheet](https://arxiv.org/abs/2201.07311).
- License: Apache 2.0
- Contact: to ask questions about this model, join the [EleutherAI 
Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`. 
Please read the existing GPT-NeoX-20B documentation before asking about the model 
on Discord. For general correspondence: [contact@eleuther.
ai](mailto:[email protected]).

<figure style="width:30em">

| Hyperparameter             | Value       |
| -------------------------- | ----------- |
| n<sub>parameters</sub>     | 783173632   |
| n<sub>encoder layers</sub> | 24          |
| n<sub>decoder layers</sub> | 24          |
| d<sub>model</sub>          | 2816        |
| d<sub>emb</sub>            | 1024        |
| n<sub>heads</sub>          | 16          |
| d<sub>head</sub>           | 64          |
| n<sub>vocab</sub>          | 32128       |
| Sequence Length            | 512         |
</figure>

### Uses and limitations

#### Intended use

Pile-T5 was developed primarily for research purposes. It learns an inner 
representation of the English language that can be used to extract features 
useful for downstream tasks.

In addition to scientific uses, you may also further fine-tune and adapt 
Pile-T5 for deployment, as long as your use is in accordance with the 
Apache 2.0 license. This model works with the [Transformers 
Library](https://huggingface.co/docs/transformers/index). If you decide to use 
pre-trained Pile-T5 as a basis for your fine-tuned model, please note that 
you need to conduct your own risk and bias assessment. 

#### Out-of-scope use

Pile-T5 is **not** intended for deployment as-is. It is not a product 
and cannot be used for human-facing interactions without supervision.

Pile-T5 has not been fine-tuned for downstream tasks for which language 
models are commonly deployed, such as writing genre prose, or commercial 
chatbots. This means Pile-T5 will likely **not** respond to a given prompt 
the way products such as ChatGPT do. This is because, unlike Pile-T5, 
ChatGPT was fine-tuned using methods such as Reinforcement Learning from Human 
Feedback (RLHF) to better “understand” human instructions and dialogue.

This model is English-language only, and thus cannot be used for translation
or generating text in other languages.

#### Limitations and biases

The core functionality of Pile-T5 is to take a string of text that has been 
partially replaced with mask tokens and predict a sequence of tokens that would
replace those mask tokens. Remember that the statistically most likely sequence
of tokens need not result in the most “accurate” text. Never rely on Pile-T5 to produce 
factually accurate output.

This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset 
known to contain profanity and texts that are lewd or otherwise offensive. 
See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a 
discussion of documented biases with regards to gender, religion, and race. 
Pile-T5 may produce socially unacceptable or undesirable text, *even if*
 the prompt itself does not include anything explicitly offensive. 

We recommend curating the outputs of this model before presenting it to a human 
reader. Please inform your audience that you are using artificially generated 
text. 

#### How to use

Pile-T5 can be loaded using the `AutoModelForSeq2SeqLM` functionality:
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pile-t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("EleutherAI/pile-t5-large")
```

### Training

#### Training dataset

The Pile is a 825GiB general-purpose dataset in English. It was created by 
EleutherAI specifically for training large language models. It contains texts 
from 22 diverse sources, roughly broken down into five categories: academic 
writing (e.g. arXiv), internet (e.g. CommonCrawl), prose (e.g. Project 
Gutenberg), dialogue (e.g. YouTube subtitles), and miscellaneous (e.g. GitHub, 
Enron Emails). See [the Pile paper](https://arxiv.org/abs/2101.00027) for 
a breakdown of all data sources, methodology, and a discussion of ethical 
implications. Consult [the datasheet](https://arxiv.org/abs/2201.07311) for 
more detailed documentation about the Pile and its component datasets. The 
Pile can be downloaded from the [official website](https://pile.eleuther.ai/), 
or from a [community mirror](https://the-eye.eu/public/AI/pile/).

The Pile was deduplicated before being used to train Pile-T5.

#### Training procedure

Pile-T5 was trained with a batch size of approximately 1M tokens 
(2048 sequences of 512 tokens each), for a total of 2,000,000 steps. Pile-T5 was trained
with the span-corruption objective.

#### Training checkpoints

Intermediate checkpoints for Pile-T5 are accessible within this repository. 
There are in total 200 checkpoints that are spaced 10,000 steps. For T5x-native
checkpoints that can be used for finetuning with the T5x library, refer to [here](https://huggingface.co/lintang/pile-t5-large-t5x)

The training loss (in tfevent format) and validation perplexity (in jsonl) can be found [here](https://huggingface.co/EleutherAI/pile-t5-large/blob/main/large.zip).

### Evaluations

Pile-T5 Large was evaluated on SuperGLUE, CodeXGLUE. A Flan-finetuned version was evaluated on Flan Held In tasks.
Results can be seen in the [blogpost](https://blog.eleuther.ai/pile-t5/)

### BibTeX

```
@misc{2024PileT5,
  author  = {Lintang Sutawika and Aran Komatsuzaki and Colin Raffel},
  title   = {Pile-T5},
  year    = {2024},
  url     = {https://blog.eleuther.ai/pile-t5/},
  note    = {Blog post},
}
```