togethercomputer
/

GPT-JT-6B-v1

@@ -78,9 +78,14 @@ widget:
 # Model Summary
-We present GPT-JT, a fork of GPT-6B, trained on 3.53 billion tokens, that outperforms most 100B+ parameter models at classification.
-GPT-JT was trained with a new decentralized algorithm on computers networked with 1Gbps interconnect, in contrast with typical 100Gbps-1.6Tbps data center networks.
-GPT-JT is a bidirectional dense model, which processes the prompt with bidirectional attention to fully leverage the context information, and uses causal attention only for token generation.
 ***Please try out our [Online Demo](https://huggingface.co/spaces/togethercomputer/GPT-JT)!***
@@ -105,8 +110,9 @@ model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1")
 ## UL2 Training Objective
 We train GPT-J using UL2 training objective [1][2].
-The usual GPT model, including GPT-J, uses the lower left causal mask to do autoregressive generation, so for each token, it can only see the context information before itself.
-In order to fully leverage the context information, we continue training with UL2 training objectives, and uses the lower right causal mask with prefix -- using bidirectional attention for the prompt and causal attention for token generation.
 $$
 \begin{bmatrix}
@@ -126,15 +132,13 @@ $$
 \end{bmatrix}
 $$
-## Data
-We fine-tune [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) on NI, P3, COT, the pile data.
 - [Natural-Instructions](https://github.com/allenai/natural-instructions)
 - [P3](https://huggingface.co/datasets/Muennighoff/P3)
 - [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
 - [the pile](https://huggingface.co/datasets/the_pile)
-We first conduct training for 2.62 billion tokens using the UL2 loss on the Pile, followed by 0.92 billion tokens with a mixture of the above datasets: 5% of COT, 20% of P3, 20% of NI, and 55% of the Pile.
 ## Hyperparameters
@@ -146,6 +150,7 @@ During training, we truncate the input sequence to 2048 tokens, and for input se
 ## Infrastructure
 We used [the Together Research Computer](https://together.xyz/) to conduct training.
 # References

 # Model Summary
+> With a new decentralized training algorithm, we fine-tuned GPT-J (6B) on 3.53 billion tokens, resulting in GPT-JT (6B), a model that outperforms many 100B+ parameter models on classification benchmarks.
+We incorporated a collection of open techniques and datasets to build GPT-JT:
+- GPT-JT was trained based on GPT-J (6B), created by [EleutherAI](https://www.eleuther.ai);
+- We used [UL2](https://github.com/google-research/google-research/tree/master/ul2)'s training objective, which allows it to use bidirectional context to process the prompt;
+- The model was trained on a large collection of diverse data, including [Chain-of-Thought (CoT)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html), [Public Pool of Prompts (P3) dataset](https://huggingface.co/datasets/bigscience/P3), [Natural-Instructions (NI) dataset](https://github.com/allenai/natural-instructions).
+With the help of techniques mentioned above, GPT-JT significantly improves the performance of classification tasks over the original GPT-J, and even outperforms most 100B+ parameter models!
 ***Please try out our [Online Demo](https://huggingface.co/spaces/togethercomputer/GPT-JT)!***
 ## UL2 Training Objective
 We train GPT-J using UL2 training objective [1][2].
+The usual GPT model, including GPT-J, uses causal mask (as shown in the lower left) to do autoregressive generation, so for each token, it can only see the context information before itself.
+In order to fully leverage the context information, we continue training GPT-J with UL2 training objectives, and uses causal mask with prefix (as shown in the lower right) -- using bidirectional attention for the prompt / input and causal attention for token generation.
+Intuitively, being able to see context bidirectionally might improve downstream tasks that requires this information.
 $$
 \begin{bmatrix}
 \end{bmatrix}
 $$
+Furthermore, we leverage a large collection of data, including NI, P3, COT, the pile:
 - [Natural-Instructions](https://github.com/allenai/natural-instructions)
 - [P3](https://huggingface.co/datasets/Muennighoff/P3)
 - [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
 - [the pile](https://huggingface.co/datasets/the_pile)
+Specifically, we first conduct training for 2.62 billion tokens using the UL2 loss on the Pile, followed by 0.92 billion tokens with a mixture of the above datasets: 5% of COT, 20% of P3, 20% of NI, and 55% of the Pile.
 ## Hyperparameters
 ## Infrastructure
 We used [the Together Research Computer](https://together.xyz/) to conduct training.
+The model was trained on computers networked with 1Gbps interconnect (in contrast, data center networks are 100Gbps-1.6Tbps).
 # References