BounharAbdelaziz commited on
Commit
7120405
1 Parent(s): 205971a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +89 -0
README.md ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ base_model: Helsinki-NLP/opus-mt-tc-big-en-ar
4
+ model-index:
5
+ - name: Terjman-Large-v2
6
+ results: []
7
+ datasets:
8
+ - atlasia/darija_english
9
+ language:
10
+ - ar
11
+ ---
12
+
13
+ # Transliteration-Moroccan-Darija
14
+
15
+ This model is trained to translate English text (en) into Moroccan Darija text (Ary) written in Arabic letters.
16
+
17
+ ## Model Overview
18
+
19
+ Our model is built upon the powerful Transformer architecture, leveraging state-of-the-art natural language processing techniques.
20
+ It has been finetuned on a the "atlasia/darija_english" dataset enhanced with curated corpora ensuring high-quality and accurate transliterations.
21
+
22
+
23
+ ## Training hyperparameters
24
+
25
+ The following hyperparameters were used during training:
26
+ - learning_rate: 2e-04
27
+ - train_batch_size: 16
28
+ - eval_batch_size: 16
29
+ - seed: 42
30
+ - gradient_accumulation_steps: 4
31
+ - total_train_batch_size: 32
32
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
33
+ - lr_scheduler_type: linear
34
+ - lr_scheduler_warmup_ratio: 0.03
35
+ - num_epochs: 30
36
+
37
+ ## Framework versions
38
+
39
+ - Transformers 4.39.2
40
+ - Pytorch 2.2.2+cpu
41
+ - Datasets 2.18.0
42
+ - Tokenizers 0.15.2
43
+
44
+ ## Usage
45
+
46
+ Using our model for translation is simple and straightforward.
47
+ You can integrate it into your projects or workflows via the Hugging Face Transformers library.
48
+ Here's a basic example of how to use the model in Python:
49
+
50
+ ```python
51
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
52
+
53
+ # Load the tokenizer and model
54
+ tokenizer = AutoTokenizer.from_pretrained("atlasia/Terjman-Large-v2")
55
+ model = AutoModelForSeq2SeqLM.from_pretrained("atlasia/Terjman-Large-v2")
56
+
57
+ # Define your Moroccan Darija Arabizi text
58
+ input_text = "Your english text goes here."
59
+
60
+ # Tokenize the input text
61
+ input_tokens = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)
62
+
63
+ # Perform transliteration
64
+ output_tokens = model.generate(**input_tokens)
65
+
66
+ # Decode the output tokens
67
+ output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
68
+
69
+ print("Transliteration:", output_text)
70
+ ```
71
+
72
+ ## Example
73
+
74
+ Let's see an example of transliterating Moroccan Darija Arabizi to Arabic:
75
+
76
+ **Input**: "Hello my friend, how's life in Morocco"
77
+
78
+ **Output**: "سالام صاحبي كيف الأحوال فالمغرب"
79
+
80
+
81
+ ## Limiations
82
+
83
+ This version has some limitations mainly due to the Tokenizer.
84
+ We're currently collecting more data with the aim of continous improvements.
85
+
86
+ ## Feedback
87
+
88
+ We're continuously striving to improve our model's performance and usability and we will be improving it incrementaly.
89
+ If you have any feedback, suggestions, or encounter any issues, please don't hesitate to reach out to us.