AraBERTv0.2-Twitter
AraBERTv0.2-Twitter-base/large are two new models for Arabic dialects and tweets, trained by continuing the pre-training using the MLM task on ~60M Arabic tweets (filtered from a collection on 100M).
The two new models have had emojies added to their vocabulary in addition to common words that weren't at first present. The pre-training was done with a max sentence length of 64 only for 1 epoch.
AraBERT is an Arabic pretrained language model based on Google's BERT architechture. AraBERT uses the same BERT-Base config. More details are available in the AraBERT Paper and in the AraBERT Meetup
Other Models
Model | HuggingFace Model Name | Size (MB/Params) | Pre-Segmentation | DataSet (Sentences/Size/nWords) |
---|---|---|---|---|
AraBERTv0.2-base | bert-base-arabertv02 | 543MB / 136M | No | 200M / 77GB / 8.6B |
AraBERTv0.2-large | bert-large-arabertv02 | 1.38G / 371M | No | 200M / 77GB / 8.6B |
AraBERTv2-base | bert-base-arabertv2 | 543MB / 136M | Yes | 200M / 77GB / 8.6B |
AraBERTv2-large | bert-large-arabertv2 | 1.38G / 371M | Yes | 200M / 77GB / 8.6B |
AraBERTv0.1-base | bert-base-arabertv01 | 543MB / 136M | No | 77M / 23GB / 2.7B |
AraBERTv1-base | bert-base-arabert | 543MB / 136M | Yes | 77M / 23GB / 2.7B |
AraBERTv0.2-Twitter-base | bert-base-arabertv02-twitter | 543MB / 136M | No | Same as v02 + 60M Multi-Dialect Tweets |
AraBERTv0.2-Twitter-large | bert-large-arabertv02-twitter | 1.38G / 371M | No | Same as v02 + 60M Multi-Dialect Tweets |
Preprocessing
The model is trained on a sequence length of 64, using max length beyond 64 might result in degraded performance
It is recommended to apply our preprocessing function before training/testing on any dataset. The preprocessor will keep and space out emojis when used with a "twitter" model.
from arabert.preprocess import ArabertPreprocessor
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_name="aubmindlab/bert-base-arabertv02-twitter"
arabert_prep = ArabertPreprocessor(model_name=model_name)
text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
arabert_prep.preprocess(text)
tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv02-twitter")
model = AutoModelForMaskedLM.from_pretrained("aubmindlab/bert-base-arabertv02-twitter")
If you used this model please cite us as :
Google Scholar has our Bibtex wrong (missing name), use this instead
@inproceedings{antoun2020arabert,
title={AraBERT: Transformer-based Model for Arabic Language Understanding},
author={Antoun, Wissam and Baly, Fady and Hajj, Hazem},
booktitle={LREC 2020 Workshop Language Resources and Evaluation Conference 11--16 May 2020},
pages={9}
}
Acknowledgments
Thanks to TensorFlow Research Cloud (TFRC) for the free access to Cloud TPUs, couldn't have done it without this program, and to the AUB MIND Lab Members for the continuous support. Also thanks to Yakshof and Assafir for data and storage access. Another thanks for Habib Rahal (https://www.behance.net/rahalhabib), for putting a face to AraBERT.
Contacts
Wissam Antoun: Linkedin | Twitter | Github | [email protected] | [email protected]
Fady Baly: Linkedin | Twitter | Github | [email protected] | [email protected]
- Downloads last month
- 9,134