Update README.md

c726c59 almost 2 years ago

4.73 kB

	---
	annotations_creators:
	- crowdsourced
	language:
	- amh
	- orm
	- lin
	- hau
	- ibo
	- kin
	- lug
	- luo
	- pcm
	- swa
	- wol
	- yor
	- bam
	- bbj
	- ewe
	- fon
	- mos
	- nya
	- sna
	- tsn
	- twi
	- xho
	- zul
	language_creators:
	- crowdsourced
	license:
	- cc-by-4.0
	multilinguality:
	- monolingual
	pretty_name: afrolm-dataset
	size_categories:
	- 1M<n<10M
	source_datasets:
	- original
	tags:
	- afrolm
	- active learning
	- language modeling
	- research papers
	- natural language processing
	- self-active learning
	task_categories:
	- fill-mask
	task_ids:
	- masked-language-modeling
	---
	# AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages
	- [GitHub Repository of the Paper](https://github.com/bonaventuredossou/MLM_AL)

	This repository contains the model for our paper [`AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages`](https://arxiv.org/pdf/2211.03263.pdf) which will appear at the Third Simple and Efficient Natural Language Processing, at EMNLP 2022.

	## Our self-active learning framework
	![Model](afrolm.png)

	## Languages Covered
	AfroLM has been pretrained from scratch on 23 African Languages: Amharic, Afan Oromo, Bambara, Ghomalá, Éwé, Fon, Hausa, Ìgbò, Kinyarwanda, Lingala, Luganda, Luo, Mooré, Chewa, Naija, Shona, Swahili, Setswana, Twi, Wolof, Xhosa, Yorùbá, and Zulu.

	## Evaluation Results
	AfroLM was evaluated on MasakhaNER1.0 (10 African Languages) and MasakhaNER2.0 (21 African Languages) datasets; on text classification and sentiment analysis. AfroLM outperformed AfriBERTa, mBERT, and XLMR-base, and was very competitive with AfroXLMR. AfroLM is also very data efficient because it was pretrained on a dataset 14x+ smaller than its competitors' datasets. Below are the average F1-score performances of various models, across various datasets. Please consult our paper for more language-level performance.

	Model \| MasakhaNER \| MasakhaNER2.0* \| Text Classification (Yoruba/Hausa) \| Sentiment Analysis (YOSM) \| OOD Sentiment Analysis (Twitter -> YOSM) \|
	\|:---: \|:---: \|:---: \| :---: \|:---: \| :---: \|
	`AfroLM-Large` \| 80.13 \| 83.26 \| 82.90/91.00 \| 85.40 \| 68.70 \|
	`AfriBERTa` \| 79.10 \| 81.31 \| 83.22/90.86 \| 82.70 \| 65.90 \|
	`mBERT` \| 71.55 \| 80.68 \| --- \| --- \| --- \|
	`XLMR-base` \| 79.16 \| 83.09 \| --- \| --- \| --- \|
	`AfroXLMR-base` \| `81.90` \| `84.55` \| --- \| --- \| --- \|

	- (*) The evaluation was made on the 11 additional languages of the dataset.
	- Bold numbers represent the performance of the model with the smallest pretrained data.
	## Pretrained Models and Dataset

	Models:: [AfroLM-Large](https://huggingface.co/bonadossou/afrolm_active_learning) and Dataset: [AfroLM Dataset](https://huggingface.co/datasets/bonadossou/afrolm_active_learning_dataset)

	## HuggingFace usage of AfroLM-large
	```python
	from transformers import XLMRobertaModel, XLMRobertaTokenizer
	model = XLMRobertaModel.from_pretrained("bonadossou/afrolm_active_learning")
	tokenizer = XLMRobertaTokenizer.from_pretrained("bonadossou/afrolm_active_learning")
	tokenizer.model_max_length = 256
	```
	`Autotokenizer` class does not successfully load our tokenizer. So we recommend to use directly the `XLMRobertaTokenizer` class. Depending on your task, you will load the according mode of the model. Read the [XLMRoberta Documentation](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)

	## Reproducing our result: Training and Evaluation

	- To train the network, run `python active_learning.py`. You can also wrap it around a `bash` script.
	- For the evaluation:
	- NER Classification: `bash ner_experiments.sh`
	- Text Classification & Sentiment Analysis: `bash text_classification_all.sh`


	## Citation

	``@inproceedings{dossou-etal-2022-afrolm,
	title = "{A}fro{LM}: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 {A}frican Languages",
	author = "Dossou, Bonaventure F. P. and
	Tonja, Atnafu Lambebo and
	Yousuf, Oreen and
	Osei, Salomey and
	Oppong, Abigail and
	Shode, Iyanuoluwa and
	Awoyomi, Oluwabusayo Olufunke and
	Emezue, Chris",
	booktitle = "Proceedings of The Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP)",
	month = dec,
	year = "2022",
	address = "Abu Dhabi, United Arab Emirates (Hybrid)",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2022.sustainlp-1.11",
	pages = "52--64"}``

	We will share the official proceeding citation as soon as possible. Stay tuned, and if you have liked our work, give it a star.

	## Reach out

	Do you have a question? Please create an issue and we will reach out as soon as possible