EXLMR / README.md
Hailay's picture
Update README.md
87d8aea verified
|
raw
history blame
2.51 kB
metadata
license: apache-2.0
datasets:
  - Hailay/TigQA
  - masakhane/masakhaner2
language:
  - am
  - ti
metrics:
  - accuracy
  - f1
base_model:
  - FacebookAI/xlm-roberta-base
pipeline_tag: zero-shot-classification
library_name: transformers

XLM-R and EXLMR Model

We introduce the EXLMR model, an extension of XLM-R, which expands its tokenizer vocabulary to incorporate new languages and alleviate out-of-vocabulary (OOV) issues. We initialize the embeddings for the newly added vocabulary in a way that allows the model to leverage this newly added vocabularies effectively. Our approach not only benefits low-resource languages but also improves performance on high-resource languages, that were part of the original XLM-R model.

Model Overview

The XLM-R (Cross-lingual Language Model - RoBERTa) is a multilingual model trained on 100 languages. The EXLMR (Extended XLM-RoBERTa) is an extended version designed to improve performance on low-resource languages spoken in Ethiopia, including Amharic, Tigrinya, and Afaan Oromo.

Model Details

  • Base Model: XLM-R
  • Extended Version: EXLMR
  • Languages Supported: Amharic, Tigrinya, Afaan Oromo, and more
  • Training Data: Trained on a large multilingual corpus

Usage

EXLMR addresses tokenization issues inherent to the XLM-R model, such as out-of-vocabulary (OOV) tokens and over-tokenization, especially for low-resource languages. Fine-tuning on specific datasets will help adapt the model to particular tasks and improve its performance. You can use this model with the transformers library for various NLP tasks.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Define the model checkpoint
checkpoint = "Hailay/EXLMR"  

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)


EXLMR has been designed to support underrepresented languages, particularly those spoken in Ethiopia (such as Amharic, Tigrinya, and Afaan Oromo). Like XLM-RoBERTa, EXLMR can be finetuned to handle multiple languages simultaneously, making it effective for cross-lingual tasks such as machine translation, multilingual text classification, and question answering. EXLMR-base follows the same architecture as RoBERTa-base, with 12 layers, 768 hidden dimensions, and 12 attention heads, totaling approximately 270M parameters.

|Model|Vocabulary Size|
|---|---|
|XLM-Roberta|250002|
|EXLMR|280147|