jacobfulano
commited on
Commit
•
29c1999
1
Parent(s):
24512df
Update README.md
Browse files
README.md
CHANGED
@@ -4,4 +4,44 @@ datasets:
|
|
4 |
- c4
|
5 |
language:
|
6 |
- en
|
7 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
- c4
|
5 |
language:
|
6 |
- en
|
7 |
+
---
|
8 |
+
|
9 |
+
# MosaicBERT base model
|
10 |
+
Our goal in developing MosaicBERT was to greatly reduce pretraining time.
|
11 |
+
|
12 |
+
## Model description
|
13 |
+
|
14 |
+
In order to build MosaicBERT, we adopted architectural choices from the recent transformer literature.
|
15 |
+
These include [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi (Press et al. 2021)](https://arxiv.org/abs/2108.12409), training in an unpadded manner,
|
16 |
+
low precision LayerNorm, and [Gated Linear Units (Shazeer 2020)](https://arxiv.org/abs/2002.05202).
|
17 |
+
|
18 |
+
1. Modifications to the Attention Mechanism
|
19 |
+
FlashAttention: Attention layers are core components of the transformer architecture. The recently proposed FlashAttention layer
|
20 |
+
reduces the number of read/write operations between the GPU HBM (high bandwidth memory, i.e. long-term memory) and the GPU SRAM
|
21 |
+
(i.e. short-term memory) [[Dao et al. 2022]](https://arxiv.org/pdf/2205.14135.pdf). We used the FlashAttention module built by
|
22 |
+
[hazy research](https://github.com/HazyResearch/flash-attention) with [OpenAI’s triton library](https://github.com/openai/triton).
|
23 |
+
|
24 |
+
|
25 |
+
# How to use
|
26 |
+
|
27 |
+
## Training data
|
28 |
+
|
29 |
+
MosaicBERT is pretrained using a standard Masked Language Modeling (MLM) objective: the model is given a sequence of
|
30 |
+
text with some tokens hidden, and it has to predict these masked tokens. MosaicBERT is trained on
|
31 |
+
the English [“Colossal, Cleaned, Common Crawl” C4 dataset](https://github.com/allenai/allennlp/discussions/5056), which contains roughly 365 million curated text documents scraped
|
32 |
+
from the internet (equivalent to 156 billion tokens). We used this more modern dataset in place of traditional BERT pretraining
|
33 |
+
corpora like English Wikipedia and BooksCorpus.
|
34 |
+
|
35 |
+
## Training procedure
|
36 |
+
|
37 |
+
## Evaluation results
|
38 |
+
|
39 |
+
When fine-tuned on downstream tasks, this model achieves the following results:
|
40 |
+
|
41 |
+
GLUE test results:
|
42 |
+
|
43 |
+
| Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average |
|
44 |
+
|:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
|
45 |
+
| | | | | | | | | | |
|
46 |
+
|
47 |
+
## Intended uses & limitations
|