---
base_model: UBC-NLP/AraT5v2-base-1024
tags:
- MSA
- Arabic Dialect
- Text-to-text
datasets:
- Murhaf/dialect_msa_silver_parallel
model-index:
- name: AraT5-MSAizer
  results: []
language:
- ar
metrics:
- bleu
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# AraT5-MSAizer

This model is a fine-tuned version of [UBC-NLP/AraT5v2-base-1024](https://huggingface.co/UBC-NLP/AraT5v2-base-1024) for translating five regional Arabic dialects into Modern Standard Arabic (MSA).


## Intended uses & limitations

This model was developed to participate in Task 2: Dialect to MSA Machine Translation under the 6th Workshop on Open-Source Arabic Corpora and Processing Tools. It was only evaluated on the development and test datasets provided by the task organizers.

## Training and evaluation data

The model was fine-tuned on a blend of four distinct datasets; three of which comprised 'gold' parallel MSA-dialect sentence pairs.
The fourth dataset, considered 'silver', was generated through back-translation from MSA to dialect.

**Gold parallel corpora**
- The Multi-Arabic Dialects Application and Resources (MADAR)
- The North Levantine Corpus
- The Parallel Arabic DIalect Corpus (PADIC)

**Synthetic Data**
A back-translated subset of the Arabic sentences in [OPUS](https://huggingface.co/datasets/Helsinki-NLP/opus-100)


### Evaluation results

BLEU score on the development split of Task 2: Dialect to MSA Machine Translation under the 6th Workshop on Open-Source Arabic Corpora and Processing Tools. 

| Model                        | BLEU   |
|------------------------------|------|
| AraT5-MSAizer.               | 0.2302 |


Official evaluation results on the held-out test split

| Model          | BLEU   | Comet DA |
|----------------|--------|----------|
| AraT5-MSAizer  | 0.2179 | 0.0016   |


## Training procedure

The model was trained by fully fine-tuning [UBC-NLP/AraT5v2-base-1024](https://huggingface.co/UBC-NLP/AraT5v2-base-1024)  for one epoch only.
The maximum input length is set to 1024 (same as in the original pre-trained model) whereas the maximum generation length is set to 512.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- warmup_ratio: 0.05
- num_epochs: 1

Full training script and configuration can be found on [https://github.com/Murhaf/AraT5-MSAizer](https://github.com/Murhaf/AraT5-MSAizer)

### Training results


### Framework versions

- Transformers 4.38.1
- Pytorch 2.0.1
- Datasets 2.17.1
- Tokenizers 0.15.2