|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- HuggingFaceFW/fineweb |
|
- HuggingFaceFW/fineweb-edu |
|
language: |
|
- en |
|
tags: |
|
- fineweb-lms |
|
- bert |
|
- token-dropping |
|
--- |
|
# FineWeb-LMs: Token Dropping BERT |
|
|
|
<p align="left"> |
|
<picture> |
|
<img alt="BERT with TensorFlow Model Garden" src="https://github.com/stefan-it/model-garden-lms/raw/main/bert_tf_model_garden.png" style="max-width: 25%;"> |
|
</picture> |
|
<br/> |
|
</p> |
|
|
|
This repository presents a Token Dropping BERT model that was pretrained on the 10BT subsets of [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu). |
|
|
|
# Pretraining Details |
|
|
|
The released BERT model is part of my [TensorFlow Model Garden LMs](https://github.com/stefan-it/model-garden-lms/tree/main) project. |
|
|
|
The pretraining was done on a v3-32 TPU VM Pod, provided by the amazing [TRC program](https://sites.research.google/trc/about/). Detailed cheatsheets are available: |
|
|
|
* [TPU VM Setup](https://github.com/stefan-it/model-garden-lms/tree/main/cheatsheet) |
|
* [Pretraining a Token Dropping BERT Model with TensorFlow Model Garden Library](https://github.com/stefan-it/model-garden-lms/blob/main/token-dropping-bert) |
|
|
|
tl;dr: The model was pretrained for 1M steps with a global batch size of 512, a sequence length of 512 using a vocab size of 64k. |
|
|
|
# Checkpoint Evaluation with ScandEval |
|
|
|
We evaluate the last 5 checkpoints (1M, 951k, 901k, 851k and 851k) with a recent version of ScandEval to check their performance and also compare it with popular encoder-only models such as BERT, RoBERTa or ELECTRA: |
|
|
|
| Model ID | Avg. Score | CoNLL-En | SST5 | ScaLA-En | SQuAD | |
|
|-------------------------------------------------------------------------------------------------------------------------------------------|--------------|-----------------------------|-----------------------------|-----------------------------|-----------------------------| |
|
| [model-garden-lms/bert-base-token-dropping-finewebs-1m](https://huggingface.co/model-garden-lms/bert-base-token-dropping-finewebs-1m) | 67.66 | 88.68 ± 0.76 / 88.47 ± 0.62 | 57.4 ± 1.7 / 59.61 ± 1.6 | 52.72 ± 5.13 / 73.6 ± 4.42 | 55.04 ± 1.54 / 65.72 ± 1.75 | |
|
| [model-garden-lms/bert-base-token-dropping-finewebs-951k](https://huggingface.co/model-garden-lms/bert-base-token-dropping-finewebs-951k) | 66.87 | 88.81 ± 0.68 / 88.64 ± 0.54 | 57.44 ± 1.39 / 56.85 ± 2.09 | 50.91 ± 5.08 / 72.22 ± 4.2 | 54.63 ± 1.3 / 65.43 ± 1.43 | |
|
| [model-garden-lms/bert-base-token-dropping-finewebs-901k](https://huggingface.co/model-garden-lms/bert-base-token-dropping-finewebs-901k) | **68.01** | 88.98 ± 0.64 / 88.67 ± 0.55 | 57.79 ± 1.31 / 58.91 ± 1.85 | 54.25 ± 6.3 / 75.73 ± 3.54 | 54.4 ± 0.72 / 65.31 ± 1.01 | |
|
| [model-garden-lms/bert-base-token-dropping-finewebs-851k](https://huggingface.co/model-garden-lms/bert-base-token-dropping-finewebs-851k) | 67.97 | 88.9 ± 0.7 / 88.81 ± 0.54 | 58.0 ± 1.02 / 58.73 ± 1.8 | 54.04 ± 2.61 / 74.89 ± 2.07 | 54.75 ± 1.08 / 65.66 ± 1.26 | |
|
| [model-garden-lms/bert-base-token-dropping-finewebs-801k](https://huggingface.co/model-garden-lms/bert-base-token-dropping-finewebs-801k) | 67.80 | 88.95 ± 0.7 / 88.73 ± 0.58 | 57.71 ± 1.43 / 60.5 ± 1.69 | 50.95 ± 6.3 / 74.16 ± 3.2 | 55.24 ± 1.37 / 66.13 ± 1.24 | |
|
| [google-bert/bert-base-cased](https://huggingface.co/google-bert/bert-base-cased) | 62.26 | 87.39 ± 0.79 / 87.11 ± 0.66 | 54.49 ± 1.36 / 53.22 ± 1.15 | 52.08 ± 2.13 / 74.52 ± 1.31 | 38.63 ± 2.1 / 50.68 ± 1.87 | |
|
| [google/electra-base-discriminator](https://huggingface.co/google/electra-base-discriminator) | 69.26 | 87.82 ± 0.69 / 86.83 ± 0.62 | 62.3 ± 1.12 / 55.93 ± 0.67 | 62.61 ± 1.21 / 80.85 ± 0.59 | 52.51 ± 0.86 / 65.2 ± 0.85 | |
|
| [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base) | 68.96 | 90.35 ± 0.23 / 90.14 ± 0.2 | 60.95 ± 1.4 / 57.52 ± 1.97 | 50.64 ± 1.69 / 74.55 ± 0.9 | 57.82 ± 1.35 / 69.68 ± 1.02 | |
|
|
|
Our pretrained Token Dropping BERT model shows only a strong performance over the original BERT model. All detailed results can be found in [this](https://huggingface.co/datasets/model-garden-lms/finewebs-scandeval-results) dataset repository. |
|
|
|
# ❤️ Acknowledgements |
|
|
|
This repository is the outcome of the last two years of working with TPUs from the awesome [TRC program](https://sites.research.google/trc/about/) and the [TensorFlow Model Garden](https://github.com/tensorflow/models) library. |
|
|
|
Made from Bavarian Oberland with ❤️ and 🥨. |
|
|