apollo-research
/

gpt2_noLN

Text Generation

text-generation-inference

Model card Files Files and versions Community

stefanhex-apollo commited on Sep 6, 2024

Commit

8fa9a4f

·

verified ·

1 Parent(s): 621ed53

Update README.md

Files changed (1) hide show

README.md +2 -3

README.md CHANGED Viewed

@@ -10,9 +10,8 @@ This is a gpt2-small model with LayerNorm fine-tuned out.
 The model was fine-tuned on OpenWebText for ~500M tokens (1000 iterations of batch size ~488 at 1024 context length) while gradually disableing LayerNorm layers.
 For details see [here](https://www.lesswrong.com/posts/THzcKKQd4oWkg4dSP/you-can-remove-gpt2-s-layernorm-by-fine-tuning-for-an-hour) and the upcoming paper.
-Available versions:
-* v2 (default): Trained for 1000 iterations in a single training run
-* v1: Trained for 900 iterations, with multiple interrup, modify LNs, and resume steps
 The model is a `GPT2LMHeadModel` (to avoid requiring `trust_remote_code`) which technically contains LayerNorm blocks.
 However, the epsilon values are all set to 1e12 so that the LayerNorm has no effect. The LN scale is set to 1e6 (to counter the 1e12 epsilon), and the bias to 0.

 The model was fine-tuned on OpenWebText for ~500M tokens (1000 iterations of batch size ~488 at 1024 context length) while gradually disableing LayerNorm layers.
 For details see [here](https://www.lesswrong.com/posts/THzcKKQd4oWkg4dSP/you-can-remove-gpt2-s-layernorm-by-fine-tuning-for-an-hour) and the upcoming paper.
+There are 5 similar models available (v1 through v5) trained with different fine-tuning schedules. Please refer to the paper
+for details. The best model (v4) is the default as of 6th September 2024 (previously v2 was the default).
 The model is a `GPT2LMHeadModel` (to avoid requiring `trust_remote_code`) which technically contains LayerNorm blocks.
 However, the epsilon values are all set to 1e12 so that the LayerNorm has no effect. The LN scale is set to 1e6 (to counter the 1e12 epsilon), and the bias to 0.