Esmail-AGumaan
/

LlTRA-model

Model card Files Files and versions Community

Esmail Atta Gumaan commited on Mar 15

Commit

bdab5f6

•

1 Parent(s): 7c8fba4

Update README.md

Files changed (1) hide show

README.md +45 -0

README.md CHANGED Viewed

@@ -16,4 +16,49 @@ In the rapidly evolving landscape of natural language processing (NLP) and machi
 Goal:
 Develop a specialized language-to-language transformer model that accurately translates from the Arabic language to the English language, ensuring semantic fidelity, contextual awareness, cross-lingual adaptability, and the retention of grammar and style. The model should provide efficient training and inference processes to make it practical and accessible for a wide range of applications, ultimately contributing to the advancement of Arabic-to-English language translation capabilities.
 ---

 Goal:
 Develop a specialized language-to-language transformer model that accurately translates from the Arabic language to the English language, ensuring semantic fidelity, contextual awareness, cross-lingual adaptability, and the retention of grammar and style. The model should provide efficient training and inference processes to make it practical and accessible for a wide range of applications, ultimately contributing to the advancement of Arabic-to-English language translation capabilities.
+---
+Dataset used:
+from hugging Face huggingface/opus_infopankki
+---
+Configuration:
+this is the settings of the model, You can customize the source and target languages, sequence lengths for each, the number of epochs, batch size, and more.
+```python
+def Get_configuration():
+    return {
+        "batch_size": 8,
+        "num_epochs": 30,
+        "lr": 10**-4,
+        "sequence_length": 100,
+        "d_model": 512,
+        "datasource": 'opus_infopankki',
+        "source_language": "ar",
+        "target_language": "en",
+        "model_folder": "weights",
+        "model_basename": "tmodel_",
+        "preload": "latest",
+        "tokenizer_file": "tokenizer_{0}.json",
+        "experiment_name": "runs/tmodel"
+    }
+```
+---
+Training:
+I used my drive to upload the project and then connected it to the Google Collab to train it:
+- hours of training: 4 hours.
+- epochs: 20.
+- number of dataset rows: 2,934,399.
+- size of the dataset: 95MB.
+- size of the auto-converted parquet files: 153MB.
+- Arabic tokens: 29999.
+- English tokens: 15697.
+- pre-trained model in collab.
+- BLEU score from Arabic to English: 19.7
 ---