xlm-roberta-base-focus-extend-kiswahili

XLM-R adapted to Kiswahili using "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models".

Code: https://github.com/konstantinjdobler/focus

Paper: https://arxiv.org/abs/2305.14481

Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("konstantindobler/xlm-roberta-base-focus-extend-kiswahili")
model = AutoModelForMaskedLM.from_pretrained("konstantindobler/xlm-roberta-base-focus-extend-kiswahili")

# Use model and tokenizer as usual

Details

The model is based on xlm-roberta-base and was adapted to Kiswahili. The original multilingual tokenizer was extended with the top 30k tokens of a language-specific Kiswahili tokenizer. The new embeddings were initialized with FOCUS. The model was then trained on data from CC100 for 390k optimizer steps. More details and hyperparameters can be found in the paper.

Disclaimer

The web-scale dataset used for pretraining and tokenizer training (CC100) might contain personal and sensitive information. Such behavior needs to be assessed carefully before any real-world deployment of the models.

Citation

Please cite FOCUS as follows:

@misc{dobler-demelo-2023-focus,
    title={FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models},
    author={Konstantin Dobler and Gerard de Melo},
    year={2023},
    eprint={2305.14481},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
7
Safetensors
Model size
296M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train konstantindobler/xlm-roberta-base-focus-extend-kiswahili