Esmail Atta Gumaan
commited on
Commit
•
bdab5f6
1
Parent(s):
7c8fba4
Update README.md
Browse files
README.md
CHANGED
@@ -16,4 +16,49 @@ In the rapidly evolving landscape of natural language processing (NLP) and machi
|
|
16 |
Goal:
|
17 |
Develop a specialized language-to-language transformer model that accurately translates from the Arabic language to the English language, ensuring semantic fidelity, contextual awareness, cross-lingual adaptability, and the retention of grammar and style. The model should provide efficient training and inference processes to make it practical and accessible for a wide range of applications, ultimately contributing to the advancement of Arabic-to-English language translation capabilities.
|
18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
---
|
|
|
16 |
Goal:
|
17 |
Develop a specialized language-to-language transformer model that accurately translates from the Arabic language to the English language, ensuring semantic fidelity, contextual awareness, cross-lingual adaptability, and the retention of grammar and style. The model should provide efficient training and inference processes to make it practical and accessible for a wide range of applications, ultimately contributing to the advancement of Arabic-to-English language translation capabilities.
|
18 |
|
19 |
+
---
|
20 |
+
|
21 |
+
Dataset used:
|
22 |
+
from hugging Face huggingface/opus_infopankki
|
23 |
+
|
24 |
+
---
|
25 |
+
|
26 |
+
Configuration:
|
27 |
+
this is the settings of the model, You can customize the source and target languages, sequence lengths for each, the number of epochs, batch size, and more.
|
28 |
+
|
29 |
+
```python
|
30 |
+
def Get_configuration():
|
31 |
+
return {
|
32 |
+
"batch_size": 8,
|
33 |
+
"num_epochs": 30,
|
34 |
+
"lr": 10**-4,
|
35 |
+
"sequence_length": 100,
|
36 |
+
"d_model": 512,
|
37 |
+
"datasource": 'opus_infopankki',
|
38 |
+
"source_language": "ar",
|
39 |
+
"target_language": "en",
|
40 |
+
"model_folder": "weights",
|
41 |
+
"model_basename": "tmodel_",
|
42 |
+
"preload": "latest",
|
43 |
+
"tokenizer_file": "tokenizer_{0}.json",
|
44 |
+
"experiment_name": "runs/tmodel"
|
45 |
+
}
|
46 |
+
```
|
47 |
+
|
48 |
+
---
|
49 |
+
|
50 |
+
Training:
|
51 |
+
I used my drive to upload the project and then connected it to the Google Collab to train it:
|
52 |
+
|
53 |
+
- hours of training: 4 hours.
|
54 |
+
- epochs: 20.
|
55 |
+
- number of dataset rows: 2,934,399.
|
56 |
+
- size of the dataset: 95MB.
|
57 |
+
- size of the auto-converted parquet files: 153MB.
|
58 |
+
- Arabic tokens: 29999.
|
59 |
+
- English tokens: 15697.
|
60 |
+
- pre-trained model in collab.
|
61 |
+
- BLEU score from Arabic to English: 19.7
|
62 |
+
|
63 |
+
|
64 |
---
|