add arxiv paper link
Browse files
README.md
CHANGED
@@ -21,6 +21,7 @@ BTLM was trained by [Cerebras](https://www.cerebras.net/) in partnership with [O
|
|
21 |
|
22 |
BTLM-3B-8k was trained with a similar architecture to [CerebrasGPT](https://arxiv.org/abs/2304.03208) with the addition of [SwiGLU](https://arxiv.org/abs/2002.05202) nonlinearity, [ALiBi](https://arxiv.org/abs/2108.12409) position embeddings, and [maximal update parameterization (muP)](https://arxiv.org/abs/2203.03466). The model was trained for 1 epoch of SlimPajama-627B. 75% of training was performed with 2k sequence length. The final 25% of training was performed at 8k sequence length to enable long sequence applications
|
23 |
|
|
|
24 |
|
25 |
## BTLM-3B-8k Highlights
|
26 |
|
@@ -134,7 +135,7 @@ Figure 5: BTLM-3B model's cross-entropy evaluation on the SlimPajama’s test se
|
|
134 |
- Positional Encoding: ALiBi
|
135 |
- Language: English
|
136 |
- Learn more: [BTLM-3B-8k blog](https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a-3-billion-parameter-model/)
|
137 |
-
- Paper:
|
138 |
|
139 |
## To continue training with PyTorch and Maximal Update Parameterization
|
140 |
|
|
|
21 |
|
22 |
BTLM-3B-8k was trained with a similar architecture to [CerebrasGPT](https://arxiv.org/abs/2304.03208) with the addition of [SwiGLU](https://arxiv.org/abs/2002.05202) nonlinearity, [ALiBi](https://arxiv.org/abs/2108.12409) position embeddings, and [maximal update parameterization (muP)](https://arxiv.org/abs/2203.03466). The model was trained for 1 epoch of SlimPajama-627B. 75% of training was performed with 2k sequence length. The final 25% of training was performed at 8k sequence length to enable long sequence applications
|
23 |
|
24 |
+
Read [our paper](https://arxiv.org/abs/2309.11568) for more details!
|
25 |
|
26 |
## BTLM-3B-8k Highlights
|
27 |
|
|
|
135 |
- Positional Encoding: ALiBi
|
136 |
- Language: English
|
137 |
- Learn more: [BTLM-3B-8k blog](https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a-3-billion-parameter-model/)
|
138 |
+
- Paper: [BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model](https://arxiv.org/abs/2309.11568)
|
139 |
|
140 |
## To continue training with PyTorch and Maximal Update Parameterization
|
141 |
|