File size: 2,518 Bytes
df91c19
 
7c8fba4
 
df91c19
7c8fba4
 
 
 
 
 
 
 
 
 
 
 
 
bdab5f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7c8fba4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
license: mit
datasets:
- opus_books
---

LlTRA stands for: Language to Language Transformer model from the paper "Attention is all you Need", building transformer model:Transformer model from scratch and using it for translation using pytorch.

---

Problem Statement:
In the rapidly evolving landscape of natural language processing (NLP) and machine translation, there exists a persistent challenge in achieving accurate and contextually rich language-to-language transformations. Existing models often struggle with capturing nuanced semantic meanings, context preservation, and maintaining grammatical coherence across different languages. Additionally, the demand for efficient cross-lingual communication and content generation has underscored the need for a versatile language transformer model that can seamlessly navigate the intricacies of diverse linguistic structures.

---

Goal:
Develop a specialized language-to-language transformer model that accurately translates from the Arabic language to the English language, ensuring semantic fidelity, contextual awareness, cross-lingual adaptability, and the retention of grammar and style. The model should provide efficient training and inference processes to make it practical and accessible for a wide range of applications, ultimately contributing to the advancement of Arabic-to-English language translation capabilities.

---

Dataset used:
from hugging Face huggingface/opus_infopankki

---

Configuration:
this is the settings of the model, You can customize the source and target languages, sequence lengths for each, the number of epochs, batch size, and more.

```python
def Get_configuration():
    return {
        "batch_size": 8,
        "num_epochs": 30,
        "lr": 10**-4,
        "sequence_length": 100,
        "d_model": 512,
        "datasource": 'opus_infopankki',
        "source_language": "ar",
        "target_language": "en",
        "model_folder": "weights",
        "model_basename": "tmodel_",
        "preload": "latest",
        "tokenizer_file": "tokenizer_{0}.json",
        "experiment_name": "runs/tmodel"
    }
```

---

Training:
I used my drive to upload the project and then connected it to the Google Collab to train it:

- hours of training: 4 hours.
- epochs: 20.
- number of dataset rows: 2,934,399.
- size of the dataset: 95MB.
- size of the auto-converted parquet files: 153MB.
- Arabic tokens: 29999.
- English tokens: 15697.
- pre-trained model in collab.
- BLEU score from Arabic to English: 19.7


---