jacobfulano
commited on
Commit
•
c66f045
1
Parent(s):
c721a25
Update README.md
Browse files
README.md
CHANGED
@@ -6,14 +6,23 @@ language:
|
|
6 |
- en
|
7 |
---
|
8 |
|
9 |
-
# MosaicBERT
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
|
12 |
## Model description
|
13 |
|
14 |
In order to build MosaicBERT, we adopted architectural choices from the recent transformer literature.
|
15 |
-
These include [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi (Press et al. 2021)](https://arxiv.org/abs/2108.12409),
|
16 |
-
|
|
|
17 |
|
18 |
### Modifications to the Attention Mechanism
|
19 |
1. **FlashAttention**: Attention layers are core components of the transformer architecture. The recently proposed FlashAttention layer
|
|
|
6 |
- en
|
7 |
---
|
8 |
|
9 |
+
# MosaicBERT-Base model
|
10 |
+
MosaicBERT-Base is a new BERT architecture and training recipe optimized for fast pretraining. MosaicBERT-Base achieves higher pretraining and finetuning accuracy
|
11 |
+
|
12 |
+
### Model Date
|
13 |
+
|
14 |
+
March 2023
|
15 |
+
|
16 |
+
## Documentation
|
17 |
+
* Blog post
|
18 |
+
* Github (mosaicml/examples repo)
|
19 |
|
20 |
## Model description
|
21 |
|
22 |
In order to build MosaicBERT, we adopted architectural choices from the recent transformer literature.
|
23 |
+
These include [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi (Press et al. 2021)](https://arxiv.org/abs/2108.12409),
|
24 |
+
and [Gated Linear Units (Shazeer 2020)](https://arxiv.org/abs/2002.05202). In addition, we remove padding inside the transformer block,
|
25 |
+
and apply LayerNorm with low precision.
|
26 |
|
27 |
### Modifications to the Attention Mechanism
|
28 |
1. **FlashAttention**: Attention layers are core components of the transformer architecture. The recently proposed FlashAttention layer
|