Update README.md
Browse files
README.md
CHANGED
@@ -78,9 +78,14 @@ widget:
|
|
78 |
|
79 |
# Model Summary
|
80 |
|
81 |
-
|
82 |
-
|
83 |
-
|
|
|
|
|
|
|
|
|
|
|
84 |
|
85 |
***Please try out our [Online Demo](https://huggingface.co/spaces/togethercomputer/GPT-JT)!***
|
86 |
|
@@ -105,8 +110,9 @@ model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1")
|
|
105 |
## UL2 Training Objective
|
106 |
|
107 |
We train GPT-J using UL2 training objective [1][2].
|
108 |
-
The usual GPT model, including GPT-J, uses the lower left
|
109 |
-
In order to fully leverage the context information, we continue training with UL2 training objectives, and uses
|
|
|
110 |
|
111 |
$$
|
112 |
\begin{bmatrix}
|
@@ -126,15 +132,13 @@ $$
|
|
126 |
\end{bmatrix}
|
127 |
$$
|
128 |
|
129 |
-
|
130 |
-
|
131 |
-
We fine-tune [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) on NI, P3, COT, the pile data.
|
132 |
- [Natural-Instructions](https://github.com/allenai/natural-instructions)
|
133 |
- [P3](https://huggingface.co/datasets/Muennighoff/P3)
|
134 |
- [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
|
135 |
- [the pile](https://huggingface.co/datasets/the_pile)
|
136 |
|
137 |
-
|
138 |
|
139 |
## Hyperparameters
|
140 |
|
@@ -146,6 +150,7 @@ During training, we truncate the input sequence to 2048 tokens, and for input se
|
|
146 |
## Infrastructure
|
147 |
|
148 |
We used [the Together Research Computer](https://together.xyz/) to conduct training.
|
|
|
149 |
|
150 |
# References
|
151 |
|
|
|
78 |
|
79 |
# Model Summary
|
80 |
|
81 |
+
> With a new decentralized training algorithm, we fine-tuned GPT-J (6B) on 3.53 billion tokens, resulting in GPT-JT (6B), a model that outperforms many 100B+ parameter models on classification benchmarks.
|
82 |
+
|
83 |
+
We incorporated a collection of open techniques and datasets to build GPT-JT:
|
84 |
+
- GPT-JT was trained based on GPT-J (6B), created by [EleutherAI](https://www.eleuther.ai);
|
85 |
+
- We used [UL2](https://github.com/google-research/google-research/tree/master/ul2)'s training objective, which allows it to use bidirectional context to process the prompt;
|
86 |
+
- The model was trained on a large collection of diverse data, including [Chain-of-Thought (CoT)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html), [Public Pool of Prompts (P3) dataset](https://huggingface.co/datasets/bigscience/P3), [Natural-Instructions (NI) dataset](https://github.com/allenai/natural-instructions).
|
87 |
+
|
88 |
+
With the help of techniques mentioned above, GPT-JT significantly improves the performance of classification tasks over the original GPT-J, and even outperforms most 100B+ parameter models!
|
89 |
|
90 |
***Please try out our [Online Demo](https://huggingface.co/spaces/togethercomputer/GPT-JT)!***
|
91 |
|
|
|
110 |
## UL2 Training Objective
|
111 |
|
112 |
We train GPT-J using UL2 training objective [1][2].
|
113 |
+
The usual GPT model, including GPT-J, uses causal mask (as shown in the lower left) to do autoregressive generation, so for each token, it can only see the context information before itself.
|
114 |
+
In order to fully leverage the context information, we continue training GPT-J with UL2 training objectives, and uses causal mask with prefix (as shown in the lower right) -- using bidirectional attention for the prompt / input and causal attention for token generation.
|
115 |
+
Intuitively, being able to see context bidirectionally might improve downstream tasks that requires this information.
|
116 |
|
117 |
$$
|
118 |
\begin{bmatrix}
|
|
|
132 |
\end{bmatrix}
|
133 |
$$
|
134 |
|
135 |
+
Furthermore, we leverage a large collection of data, including NI, P3, COT, the pile:
|
|
|
|
|
136 |
- [Natural-Instructions](https://github.com/allenai/natural-instructions)
|
137 |
- [P3](https://huggingface.co/datasets/Muennighoff/P3)
|
138 |
- [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
|
139 |
- [the pile](https://huggingface.co/datasets/the_pile)
|
140 |
|
141 |
+
Specifically, we first conduct training for 2.62 billion tokens using the UL2 loss on the Pile, followed by 0.92 billion tokens with a mixture of the above datasets: 5% of COT, 20% of P3, 20% of NI, and 55% of the Pile.
|
142 |
|
143 |
## Hyperparameters
|
144 |
|
|
|
150 |
## Infrastructure
|
151 |
|
152 |
We used [the Together Research Computer](https://together.xyz/) to conduct training.
|
153 |
+
The model was trained on computers networked with 1Gbps interconnect (in contrast, data center networks are 100Gbps-1.6Tbps).
|
154 |
|
155 |
# References
|
156 |
|