Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: en
|
3 |
+
|
4 |
+
---
|
5 |
+
|
6 |
+
|
7 |
+
# OLM RoBERTa/BERT October 2022
|
8 |
+
|
9 |
+
This is a more up-to-date version of the [original BERT](https://huggingface.co/bert-base-cased) and [original RoBERTa](https://huggingface.co/roberta-base).
|
10 |
+
In addition to being more up-to-date, it also tends to perform better than the original BERT on standard benchmarks.
|
11 |
+
We think it is more fair to directly compare our model to the original BERT because our model was trained with about the same level of compute as the original BERT, and the architecture of BERT and RoBERTa are basically the same.
|
12 |
+
RoBERTa takes an order of magnitude more compute, although our model is also not that different in performance from RoBERTa on standard benchmarks.
|
13 |
+
Our model was trained on a cleaned October 2022 snapshot of Common Crawl and Wikipedia.
|
14 |
+
|
15 |
+
This model was created as part of the OLM project, which has the goal of continuously training and releasing models that are up-to-date and comparable in standard language model performance to their static counterparts.
|
16 |
+
This is important because we want our models to know about events like COVID or
|
17 |
+
a presidential election right after they happen.
|
18 |
+
|
19 |
+
## Intended uses
|
20 |
+
|
21 |
+
You can use the raw model for text generation or fine-tune it to a downstream task.
|
22 |
+
|
23 |
+
## How to use
|
24 |
+
|
25 |
+
TODO
|
26 |
+
|
27 |
+
## Dataset
|
28 |
+
|
29 |
+
The model and tokenizer were trained with this [October 2022 cleaned Common Crawl dataset](https://huggingface.co/datasets/olm/olm-CC-MAIN-2022-40-sampling-ratio-0.15894621295) plus this [October 2022 cleaned Wikipedia dataset](https://huggingface.co/datasets/olm/olm-wikipedia-20221001).\
|
30 |
+
The tokenized version of these concatenated datasets is [here](https://huggingface.co/datasets/olm/olm-october-2022-tokenized-512).\
|
31 |
+
The datasets were created with this [repo](https://github.com/huggingface/olm-datasets).
|
32 |
+
|
33 |
+
## Training
|
34 |
+
|
35 |
+
The model was trained according to the OLM BERT/RoBERTa instructions at this [repo](https://github.com/huggingface/olm-training).
|
36 |
+
|
37 |
+
## Evaluation results
|
38 |
+
|
39 |
+
The model achieves the following results after being tuned on GLUE tasks:
|
40 |
+
|
41 |
+
| Task | Metric | Original BERT | OLM RoBERTa Oct 2022 (Ours) |
|
42 |
+
|:-----|:---------|----------------:|----------------------------:|
|
43 |
+
|cola |mcc |**0.5889** |0.2340 |
|
44 |
+
|sst2 |acc |0.9181 |**0.9305** |
|
45 |
+
|mrpc |acc/f1 |**0.9182**/0.8923|0.8828/**0.9149** |
|
46 |
+
|stsb |pear/spear|0.8822/0.8794 |**0.8943**/**0.8934** |
|
47 |
+
|qqp |acc/f1 |0.9071/0.8748 |**0.9094**/**0.8781** |
|
48 |
+
|mnli |acc/acc_mm|0.8400/0.8410 |**0.8599**/**0.8622** |
|
49 |
+
|qnli |acc |0.9075 |**0.9148** |
|
50 |
+
|rte |acc |**0.6296** |0.6253 |
|
51 |
+
|wnli |acc |0.4000 |**0.5042** |
|
52 |
+
|
53 |
+
For both the original BERT and our model, we used the Hugging Face run_glue.py script [here](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification).
|
54 |
+
For both models, we used the default fine-tuning hyperparameters and we averaged the results over five training seeds.
|