AraT5-MSAizer

This model is a fine-tuned version of UBC-NLP/AraT5v2-base-1024 for translating five regional Arabic dialects into Modern Standard Arabic (MSA).

Intended uses & limitations

This model was developed to participate in Task 2: Dialect to MSA Machine Translation under the 6th Workshop on Open-Source Arabic Corpora and Processing Tools. It was only evaluated on the development and test datasets provided by the task organizers.

Training and evaluation data

The model was fine-tuned on a blend of four distinct datasets; three of which comprised 'gold' parallel MSA-dialect sentence pairs. The fourth dataset, considered 'silver', was generated through back-translation from MSA to dialect.

Gold parallel corpora

  • The Multi-Arabic Dialects Application and Resources (MADAR)
  • The North Levantine Corpus
  • The Parallel Arabic DIalect Corpus (PADIC)

Synthetic Data A back-translated subset of the Arabic sentences in OPUS

Evaluation results

BLEU score on the development split of Task 2: Dialect to MSA Machine Translation under the 6th Workshop on Open-Source Arabic Corpora and Processing Tools.

Model BLEU
AraT5-MSAizer. 0.2302

Official evaluation results on the held-out test split

Model BLEU Comet DA
AraT5-MSAizer 0.2179 0.0016

Training procedure

The model was trained by fully fine-tuning UBC-NLP/AraT5v2-base-1024 for one epoch only. The maximum input length is set to 1024 (same as in the original pre-trained model) whereas the maximum generation length is set to 512.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 32
  • eval_batch_size: 32
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • warmup_ratio: 0.05
  • num_epochs: 1

Full training script and configuration can be found on https://github.com/Murhaf/AraT5-MSAizer

Training results

Framework versions

  • Transformers 4.38.1
  • Pytorch 2.0.1
  • Datasets 2.17.1
  • Tokenizers 0.15.2
Downloads last month
17
Safetensors
Model size
368M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Murhaf/AraT5-MSAizer

Finetuned
(11)
this model