arxiv:2101.00204

BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla

Published on Jan 1, 2021

Authors:

Abstract

In this work, we introduce <PRE_TAG>Bangla BERT</POST_TAG>, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain <PRE_TAG>Bangla BERT</POST_TAG>, we collect 27.5 GB of Bangla pretraining data (dubbed `<PRE_TAG>Bangla2B+</POST_TAG>') by crawling 110 popular Bangla sites. We introduce two downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction. In the process, we bring them under the first-ever Bangla Language Understanding Benchmark (BLUB). <PRE_TAG>Bangla BERT</POST_TAG> achieves state-of-the-art results outperforming multilingual and monolingual models. We are making the models, datasets, and a leaderboard publicly available at https://github.com/csebuetnlp/banglabert to advance Bangla NLP.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No Space linking this paper

Cite arxiv.org/abs/2101.00204 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.