|
--- |
|
license: mit |
|
datasets: |
|
- opus_books |
|
--- |
|
|
|
LlTRA stands for: Language to Language Transformer model from the paper "Attention is all you Need", building transformer model:Transformer model from scratch and using it for translation using pytorch. |
|
|
|
--- |
|
|
|
Problem Statement: |
|
In the rapidly evolving landscape of natural language processing (NLP) and machine translation, there exists a persistent challenge in achieving accurate and contextually rich language-to-language transformations. Existing models often struggle with capturing nuanced semantic meanings, context preservation, and maintaining grammatical coherence across different languages. Additionally, the demand for efficient cross-lingual communication and content generation has underscored the need for a versatile language transformer model that can seamlessly navigate the intricacies of diverse linguistic structures. |
|
|
|
--- |
|
|
|
Goal: |
|
Develop a specialized language-to-language transformer model that accurately translates from the Arabic language to the English language, ensuring semantic fidelity, contextual awareness, cross-lingual adaptability, and the retention of grammar and style. The model should provide efficient training and inference processes to make it practical and accessible for a wide range of applications, ultimately contributing to the advancement of Arabic-to-English language translation capabilities. |
|
|
|
--- |
|
|
|
Dataset used: |
|
from hugging Face huggingface/opus_infopankki |
|
|
|
--- |
|
|
|
Configuration: |
|
this is the settings of the model, You can customize the source and target languages, sequence lengths for each, the number of epochs, batch size, and more. |
|
|
|
```python |
|
def Get_configuration(): |
|
return { |
|
"batch_size": 8, |
|
"num_epochs": 30, |
|
"lr": 10**-4, |
|
"sequence_length": 100, |
|
"d_model": 512, |
|
"datasource": 'opus_infopankki', |
|
"source_language": "ar", |
|
"target_language": "en", |
|
"model_folder": "weights", |
|
"model_basename": "tmodel_", |
|
"preload": "latest", |
|
"tokenizer_file": "tokenizer_{0}.json", |
|
"experiment_name": "runs/tmodel" |
|
} |
|
``` |
|
|
|
--- |
|
|
|
Training: |
|
I used my drive to upload the project and then connected it to the Google Collab to train it: |
|
|
|
- hours of training: 4 hours. |
|
- epochs: 20. |
|
- number of dataset rows: 2,934,399. |
|
- size of the dataset: 95MB. |
|
- size of the auto-converted parquet files: 153MB. |
|
- Arabic tokens: 29999. |
|
- English tokens: 15697. |
|
- pre-trained model in collab. |
|
- BLEU score from Arabic to English: 19.7 |
|
|
|
|
|
--- |