Update README.md
Browse files
README.md
CHANGED
@@ -109,8 +109,8 @@ model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1")
|
|
109 |
|
110 |
## UL2 Training Objective
|
111 |
|
112 |
-
We train GPT-
|
113 |
-
The
|
114 |
In order to fully leverage the context information, we continue training GPT-J with UL2 training objectives, and uses causal mask with prefix (as shown in the lower right) -- using bidirectional attention for the prompt / input and causal attention for token generation.
|
115 |
Intuitively, being able to see context bidirectionally might improve downstream tasks that requires this information.
|
116 |
|
@@ -136,7 +136,7 @@ Furthermore, we leverage a large collection of data, including NI, P3, COT, the
|
|
136 |
- [Natural-Instructions](https://github.com/allenai/natural-instructions)
|
137 |
- [P3](https://huggingface.co/datasets/Muennighoff/P3)
|
138 |
- [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
|
139 |
-
- [the
|
140 |
|
141 |
Specifically, we first conduct training for 2.62 billion tokens using the UL2 loss on the Pile, followed by 0.92 billion tokens with a mixture of the above datasets: 5% of COT, 20% of P3, 20% of NI, and 55% of the Pile.
|
142 |
|
|
|
109 |
|
110 |
## UL2 Training Objective
|
111 |
|
112 |
+
We train GPT-JT using UL2 training objective [1][2].
|
113 |
+
The original GPT-J uses causal mask (as shown in the lower left) to perform autoregressive generation.So for each token, it can only see its previous context.
|
114 |
In order to fully leverage the context information, we continue training GPT-J with UL2 training objectives, and uses causal mask with prefix (as shown in the lower right) -- using bidirectional attention for the prompt / input and causal attention for token generation.
|
115 |
Intuitively, being able to see context bidirectionally might improve downstream tasks that requires this information.
|
116 |
|
|
|
136 |
- [Natural-Instructions](https://github.com/allenai/natural-instructions)
|
137 |
- [P3](https://huggingface.co/datasets/Muennighoff/P3)
|
138 |
- [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
|
139 |
+
- [the Pile](https://huggingface.co/datasets/the_pile)
|
140 |
|
141 |
Specifically, we first conduct training for 2.62 billion tokens using the UL2 loss on the Pile, followed by 0.92 billion tokens with a mixture of the above datasets: 5% of COT, 20% of P3, 20% of NI, and 55% of the Pile.
|
142 |
|