atlasia
/

Terjman-Large-v2

Text2Text Generation

Inference Endpoints

Model card Files Files and versions Community

BounharAbdelaziz commited on May 19

Commit

7120405

•

1 Parent(s): 205971a

Create README.md

Files changed (1) hide show

README.md +89 -0

README.md ADDED Viewed

	@@ -0,0 +1,89 @@

+---
+license: cc-by-nc-4.0
+base_model: Helsinki-NLP/opus-mt-tc-big-en-ar
+model-index:
+- name: Terjman-Large-v2
+  results: []
+datasets:
+- atlasia/darija_english
+language:
+- ar
+---
+# Transliteration-Moroccan-Darija
+This model is trained to translate English text (en) into Moroccan Darija text (Ary) written in Arabic letters.
+## Model Overview
+Our model is built upon the powerful Transformer architecture, leveraging state-of-the-art natural language processing techniques.
+It has been finetuned on a the "atlasia/darija_english" dataset enhanced with curated corpora ensuring high-quality and accurate transliterations.
+## Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 2e-04
+- train_batch_size: 16
+- eval_batch_size: 16
+- seed: 42
+- gradient_accumulation_steps: 4
+- total_train_batch_size: 32
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_ratio: 0.03
+- num_epochs: 30
+## Framework versions
+- Transformers 4.39.2
+- Pytorch 2.2.2+cpu
+- Datasets 2.18.0
+- Tokenizers 0.15.2
+## Usage
+Using our model for translation is simple and straightforward.
+You can integrate it into your projects or workflows via the Hugging Face Transformers library.
+Here's a basic example of how to use the model in Python:
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+# Load the tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained("atlasia/Terjman-Large-v2")
+model = AutoModelForSeq2SeqLM.from_pretrained("atlasia/Terjman-Large-v2")
+# Define your Moroccan Darija Arabizi text
+input_text = "Your english text goes here."
+# Tokenize the input text
+input_tokens = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)
+# Perform transliteration
+output_tokens = model.generate(**input_tokens)
+# Decode the output tokens
+output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
+print("Transliteration:", output_text)
+```
+## Example
+Let's see an example of transliterating Moroccan Darija Arabizi to Arabic:
+**Input**: "Hello my friend, how's life in Morocco"
+**Output**: "سالام صاحبي كيف الأحوال فالمغرب"
+## Limiations
+This version has some limitations mainly due to the Tokenizer.
+We're currently collecting more data with the aim of continous improvements.
+## Feedback
+We're continuously striving to improve our model's performance and usability and we will be improving it incrementaly.
+If you have any feedback, suggestions, or encounter any issues, please don't hesitate to reach out to us.