stefanhex-apollo
commited on
Commit
•
64a829f
1
Parent(s):
5d176c5
Update README.md
Browse files
README.md
CHANGED
@@ -8,9 +8,8 @@ tags: []
|
|
8 |
This is a gpt2-small model with LayerNorm fine-tuned out.
|
9 |
|
10 |
The model was fine-tuned on OpenWebText for ~500M tokens (1000 iterations of batch size ~488 at 1024 context length) while gradually disableing LayerNorm layers.
|
11 |
-
For details see [here](https://www.lesswrong.com/posts/THzcKKQd4oWkg4dSP/you-can-remove-gpt2-s-layernorm-by-fine-tuning-for-an-hour) and the upcoming paper.
|
12 |
|
13 |
-
There are 5 similar models available (v1 through v5) trained with different fine-tuning schedules. Please refer to the [paper](https://arxiv.org/abs/2409.13710)
|
14 |
for details; the training code is available [here](https://github.com/ApolloResearch/gpt2_noLN). The best model (v4) is the default as of 6th September 2024 (previously v2 was the default).
|
15 |
|
16 |
The model is a `GPT2LMHeadModel` (to avoid requiring `trust_remote_code`) which technically contains LayerNorm blocks.
|
|
|
8 |
This is a gpt2-small model with LayerNorm fine-tuned out.
|
9 |
|
10 |
The model was fine-tuned on OpenWebText for ~500M tokens (1000 iterations of batch size ~488 at 1024 context length) while gradually disableing LayerNorm layers.
|
|
|
11 |
|
12 |
+
There are 5 similar models available (v1 through v5) trained with different fine-tuning schedules. Please refer to the [paper](https://arxiv.org/abs/2409.13710) or [blog post](https://www.lesswrong.com/posts/THzcKKQd4oWkg4dSP/you-can-remove-gpt2-s-layernorm-by-fine-tuning-for-an-hour)
|
13 |
for details; the training code is available [here](https://github.com/ApolloResearch/gpt2_noLN). The best model (v4) is the default as of 6th September 2024 (previously v2 was the default).
|
14 |
|
15 |
The model is a `GPT2LMHeadModel` (to avoid requiring `trust_remote_code`) which technically contains LayerNorm blocks.
|