--- base_model: UBC-NLP/AraT5v2-base-1024 tags: - MSA - Arabic Dialect - Text-to-text datasets: - Murhaf/dialect_msa_silver_parallel model-index: - name: AraT5-MSAizer results: [] language: - ar metrics: - bleu --- # AraT5-MSAizer This model is a fine-tuned version of [UBC-NLP/AraT5v2-base-1024](https://huggingface.co/UBC-NLP/AraT5v2-base-1024) for translating five regional Arabic dialects into Modern Standard Arabic (MSA). ## Intended uses & limitations This model was developed to participate in Task 2: Dialect to MSA Machine Translation under the 6th Workshop on Open-Source Arabic Corpora and Processing Tools. It was only evaluated on the development and test datasets provided by the task organizers. ## Training and evaluation data The model was fine-tuned on a blend of four distinct datasets; three of which comprised 'gold' parallel MSA-dialect sentence pairs. The fourth dataset, considered 'silver', was generated through back-translation from MSA to dialect. **Gold parallel corpora** - The Multi-Arabic Dialects Application and Resources (MADAR) - The North Levantine Corpus - The Parallel Arabic DIalect Corpus (PADIC) **Synthetic Data** A back-translated subset of the Arabic sentences in [OPUS](https://huggingface.co/datasets/Helsinki-NLP/opus-100) ### Evaluation results BLEU score on the development split of Task 2: Dialect to MSA Machine Translation under the 6th Workshop on Open-Source Arabic Corpora and Processing Tools. | Model | BLEU | |------------------------------|------| | AraT5-MSAizer. | 0.2302 | Official evaluation results on the held-out test split | Model | BLEU | Comet DA | |----------------|--------|----------| | AraT5-MSAizer | 0.2179 | 0.0016 | ## Training procedure The model was trained by fully fine-tuning [UBC-NLP/AraT5v2-base-1024](https://huggingface.co/UBC-NLP/AraT5v2-base-1024) for one epoch only. The maximum input length is set to 1024 (same as in the original pre-trained model) whereas the maximum generation length is set to 512. ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 32 - eval_batch_size: 32 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - warmup_ratio: 0.05 - num_epochs: 1 Full training script and configuration can be found on [https://github.com/Murhaf/AraT5-MSAizer](https://github.com/Murhaf/AraT5-MSAizer) ### Training results ### Framework versions - Transformers 4.38.1 - Pytorch 2.0.1 - Datasets 2.17.1 - Tokenizers 0.15.2