rskuzma commited on
Commit
68be314
1 Parent(s): 099ed6b

add arxiv paper link

Browse files
Files changed (1) hide show
  1. README.md +2 -1
README.md CHANGED
@@ -21,6 +21,7 @@ BTLM was trained by [Cerebras](https://www.cerebras.net/) in partnership with [O
21
 
22
  BTLM-3B-8k was trained with a similar architecture to [CerebrasGPT](https://arxiv.org/abs/2304.03208) with the addition of [SwiGLU](https://arxiv.org/abs/2002.05202) nonlinearity, [ALiBi](https://arxiv.org/abs/2108.12409) position embeddings, and [maximal update parameterization (muP)](https://arxiv.org/abs/2203.03466). The model was trained for 1 epoch of SlimPajama-627B. 75% of training was performed with 2k sequence length. The final 25% of training was performed at 8k sequence length to enable long sequence applications
23
 
 
24
 
25
  ## BTLM-3B-8k Highlights
26
 
@@ -134,7 +135,7 @@ Figure 5: BTLM-3B model's cross-entropy evaluation on the SlimPajama’s test se
134
  - Positional Encoding: ALiBi
135
  - Language: English
136
  - Learn more: [BTLM-3B-8k blog](https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a-3-billion-parameter-model/)
137
- - Paper: Coming soon
138
 
139
  ## To continue training with PyTorch and Maximal Update Parameterization
140
 
 
21
 
22
  BTLM-3B-8k was trained with a similar architecture to [CerebrasGPT](https://arxiv.org/abs/2304.03208) with the addition of [SwiGLU](https://arxiv.org/abs/2002.05202) nonlinearity, [ALiBi](https://arxiv.org/abs/2108.12409) position embeddings, and [maximal update parameterization (muP)](https://arxiv.org/abs/2203.03466). The model was trained for 1 epoch of SlimPajama-627B. 75% of training was performed with 2k sequence length. The final 25% of training was performed at 8k sequence length to enable long sequence applications
23
 
24
+ Read [our paper](https://arxiv.org/abs/2309.11568) for more details!
25
 
26
  ## BTLM-3B-8k Highlights
27
 
 
135
  - Positional Encoding: ALiBi
136
  - Language: English
137
  - Learn more: [BTLM-3B-8k blog](https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a-3-billion-parameter-model/)
138
+ - Paper: [BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model](https://arxiv.org/abs/2309.11568)
139
 
140
  ## To continue training with PyTorch and Maximal Update Parameterization
141