apollo-research
/

gpt2_noLN

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

stefanhex-apollo commited on Sep 6, 2024

Commit

b338000

·

verified ·

1 Parent(s): 8fa9a4f

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -10,7 +10,7 @@ This is a gpt2-small model with LayerNorm fine-tuned out.
 The model was fine-tuned on OpenWebText for ~500M tokens (1000 iterations of batch size ~488 at 1024 context length) while gradually disableing LayerNorm layers.
 For details see [here](https://www.lesswrong.com/posts/THzcKKQd4oWkg4dSP/you-can-remove-gpt2-s-layernorm-by-fine-tuning-for-an-hour) and the upcoming paper.
-There are 5 similar models available (v1 through v5) trained with different fine-tuning schedules. Please refer to the paper
 for details. The best model (v4) is the default as of 6th September 2024 (previously v2 was the default).
 The model is a `GPT2LMHeadModel` (to avoid requiring `trust_remote_code`) which technically contains LayerNorm blocks.

 The model was fine-tuned on OpenWebText for ~500M tokens (1000 iterations of batch size ~488 at 1024 context length) while gradually disableing LayerNorm layers.
 For details see [here](https://www.lesswrong.com/posts/THzcKKQd4oWkg4dSP/you-can-remove-gpt2-s-layernorm-by-fine-tuning-for-an-hour) and the upcoming paper.
+There are 5 similar models available (v1 through v5) trained with different fine-tuning schedules. Please refer to the [paper](https://publications.apolloresearch.ai/remove_layer_norm.pdf)
 for details. The best model (v4) is the default as of 6th September 2024 (previously v2 was the default).
 The model is a `GPT2LMHeadModel` (to avoid requiring `trust_remote_code`) which technically contains LayerNorm blocks.