stefanhex-apollo
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -10,7 +10,7 @@ This is a gpt2-small model with LayerNorm fine-tuned out.
|
|
10 |
The model was fine-tuned on OpenWebText for ~500M tokens (1000 iterations of batch size ~488 at 1024 context length) while gradually disableing LayerNorm layers.
|
11 |
For details see [here](https://www.lesswrong.com/posts/THzcKKQd4oWkg4dSP/you-can-remove-gpt2-s-layernorm-by-fine-tuning-for-an-hour) and the upcoming paper.
|
12 |
|
13 |
-
There are 5 similar models available (v1 through v5) trained with different fine-tuning schedules. Please refer to the paper
|
14 |
for details. The best model (v4) is the default as of 6th September 2024 (previously v2 was the default).
|
15 |
|
16 |
The model is a `GPT2LMHeadModel` (to avoid requiring `trust_remote_code`) which technically contains LayerNorm blocks.
|
|
|
10 |
The model was fine-tuned on OpenWebText for ~500M tokens (1000 iterations of batch size ~488 at 1024 context length) while gradually disableing LayerNorm layers.
|
11 |
For details see [here](https://www.lesswrong.com/posts/THzcKKQd4oWkg4dSP/you-can-remove-gpt2-s-layernorm-by-fine-tuning-for-an-hour) and the upcoming paper.
|
12 |
|
13 |
+
There are 5 similar models available (v1 through v5) trained with different fine-tuning schedules. Please refer to the [paper](https://publications.apolloresearch.ai/remove_layer_norm.pdf)
|
14 |
for details. The best model (v4) is the default as of 6th September 2024 (previously v2 was the default).
|
15 |
|
16 |
The model is a `GPT2LMHeadModel` (to avoid requiring `trust_remote_code`) which technically contains LayerNorm blocks.
|