--- license: cc-by-nc-4.0 base_model: Helsinki-NLP/opus-mt-tc-big-en-ar model-index: - name: Terjman-Large-v2 results: [] datasets: - atlasia/darija_english language: - ar --- # Transliteration-Moroccan-Darija This model is trained to translate English text (en) into Moroccan Darija text (Ary) written in Arabic letters. ## Model Overview Our model is built upon the powerful Transformer architecture, leveraging state-of-the-art natural language processing techniques. It has been finetuned on a the "atlasia/darija_english" dataset enhanced with curated corpora ensuring high-quality and accurate transliterations. ## Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-04 - train_batch_size: 16 - eval_batch_size: 16 - seed: 42 - gradient_accumulation_steps: 4 - total_train_batch_size: 32 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.03 - num_epochs: 30 ## Framework versions - Transformers 4.39.2 - Pytorch 2.2.2+cpu - Datasets 2.18.0 - Tokenizers 0.15.2 ## Usage Using our model for translation is simple and straightforward. You can integrate it into your projects or workflows via the Hugging Face Transformers library. Here's a basic example of how to use the model in Python: ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM # Load the tokenizer and model tokenizer = AutoTokenizer.from_pretrained("atlasia/Terjman-Large-v2") model = AutoModelForSeq2SeqLM.from_pretrained("atlasia/Terjman-Large-v2") # Define your Moroccan Darija Arabizi text input_text = "Your english text goes here." # Tokenize the input text input_tokens = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True) # Perform translation output_tokens = model.generate(**input_tokens) # Decode the output tokens output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True) print("Transliteration:", output_text) ``` ## Example Let's see an example of transliterating Moroccan Darija Arabizi to Arabic: **Input**: "Hello my friend, how's life in Morocco" **Output**: "سالام صاحبي كيف الأحوال فالمغرب" ## Limiations This version has some limitations mainly due to the Tokenizer. We're currently collecting more data with the aim of continous improvements. ## Feedback We're continuously striving to improve our model's performance and usability and we will be improving it incrementaly. If you have any feedback, suggestions, or encounter any issues, please don't hesitate to reach out to us.