cerebras
/

btlm-3b-8k-base

Text Generation

Model card Files Files and versions Community

rskuzma commited on Sep 22, 2023

Commit

68be314

•

1 Parent(s): 099ed6b

add arxiv paper link

Files changed (1) hide show

README.md +2 -1

README.md CHANGED Viewed

@@ -21,6 +21,7 @@ BTLM was trained by [Cerebras](https://www.cerebras.net/) in partnership with [O
 BTLM-3B-8k was trained with a similar architecture to [CerebrasGPT](https://arxiv.org/abs/2304.03208) with the addition of [SwiGLU](https://arxiv.org/abs/2002.05202) nonlinearity, [ALiBi](https://arxiv.org/abs/2108.12409) position embeddings, and [maximal update parameterization (muP)](https://arxiv.org/abs/2203.03466). The model was trained for 1 epoch of SlimPajama-627B. 75% of training was performed with 2k sequence length. The final 25% of training was performed at 8k sequence length to enable long sequence applications
 ## BTLM-3B-8k Highlights
@@ -134,7 +135,7 @@ Figure 5: BTLM-3B model's cross-entropy evaluation on the SlimPajama’s test se
 - Positional Encoding: ALiBi
 - Language: English
 - Learn more: [BTLM-3B-8k blog](https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a-3-billion-parameter-model/)
-- Paper: Coming soon
 ## To continue training with PyTorch and Maximal Update Parameterization

 BTLM-3B-8k was trained with a similar architecture to [CerebrasGPT](https://arxiv.org/abs/2304.03208) with the addition of [SwiGLU](https://arxiv.org/abs/2002.05202) nonlinearity, [ALiBi](https://arxiv.org/abs/2108.12409) position embeddings, and [maximal update parameterization (muP)](https://arxiv.org/abs/2203.03466). The model was trained for 1 epoch of SlimPajama-627B. 75% of training was performed with 2k sequence length. The final 25% of training was performed at 8k sequence length to enable long sequence applications
+Read [our paper](https://arxiv.org/abs/2309.11568) for more details!
 ## BTLM-3B-8k Highlights
 - Positional Encoding: ALiBi
 - Language: English
 - Learn more: [BTLM-3B-8k blog](https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a-3-billion-parameter-model/)
+- Paper: [BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model](https://arxiv.org/abs/2309.11568)
 ## To continue training with PyTorch and Maximal Update Parameterization