Terjman-Large-v1.1 / README.md
BounharAbdelaziz's picture
Update README.md
711e129 verified
metadata
license: cc-by-nc-4.0
base_model: Helsinki-NLP/opus-mt-tc-big-en-ar
metrics:
  - bleu
datasets:
  - atlasia/darija_english
model-index:
  - name: Terjman-Large
    results: []
language:
  - ar
  - en

Terjman-Large (240M params)

Our model is built upon the powerful Transformer architecture, leveraging state-of-the-art natural language processing techniques. It is a fine-tuned version of Helsinki-NLP/opus-mt-tc-big-en-ar on a the darija_english dataset enhanced with curated corpora ensuring high-quality and accurate translations.

It achieves the following results on the evaluation set:

  • Loss: 3.2078
  • Bleu: 8.3292
  • Gen Len: 34.4959

The finetuning was conducted using a A100-40GB and took 23 hours.

Try it out on our dedicated Terjman-Large Space 🤗

Usage

Using our model for translation is simple and straightforward. You can integrate it into your projects or workflows via the Hugging Face Transformers library. Here's a basic example of how to use the model in Python:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("atlasia/Terjman-Large")
model = AutoModelForSeq2SeqLM.from_pretrained("atlasia/Terjman-Large")

# Define your Moroccan Darija Arabizi text
input_text = "Your english text goes here."

# Tokenize the input text
input_tokens = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)

# Perform translation
output_tokens = model.generate(**input_tokens)

# Decode the output tokens
output_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

print("Translation:", output_text)

Example

Let's see an example of transliterating Moroccan Darija Arabizi to Arabic:

Input: "Hi my friend, can you tell me a joke in moroccan darija? I'd be happy to hear that from you!"

Output: "مرحبا صديقي، يمكن لك تقول لي نكتة في داريجا المغربية؟ سأكون سعيدا بسماعها منك!"

Limiations

This version has some limitations mainly due to the Tokenizer. We're currently collecting more data with the aim of continous improvements.

Feedback

We're continuously striving to improve our model's performance and usability and we will be improving it incrementaly. If you have any feedback, suggestions, or encounter any issues, please don't hesitate to reach out to us.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 22
  • eval_batch_size: 22
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 88
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.03
  • num_epochs: 40

Training results

Training Loss Epoch Step Validation Loss Bleu Gen Len
No log 0.9982 407 4.3938 4.6056 22.6033
5.1616 1.9988 815 3.7257 5.8319 30.9201
3.902 2.9994 1223 3.5214 6.7311 32.9091
3.5737 4.0 1631 3.4204 7.3684 32.1433
3.4576 4.9982 2038 3.3562 7.8632 34.5399
3.4576 5.9988 2446 3.3151 7.9739 35.3278
3.3833 6.9994 2854 3.2884 8.0825 35.8292
3.3358 8.0 3262 3.2681 8.2765 34.5427
3.3069 8.9982 3669 3.2517 8.1019 33.584
3.2769 9.9988 4077 3.2404 8.106 33.3802
3.2769 10.9994 4485 3.2342 8.3037 33.303
3.2777 12.0 4893 3.2284 8.0674 33.3967
3.2476 12.9982 5300 3.2226 8.2883 33.8154
3.2611 13.9988 5708 3.2189 8.3537 34.0413
3.2511 14.9994 6116 3.2159 8.1365 34.5014
3.2437 16.0 6524 3.2140 8.3549 34.0606
3.2437 16.9982 6931 3.2131 8.2507 34.303
3.2498 17.9988 7339 3.2116 8.2928 33.9945
3.2341 18.9994 7747 3.2105 8.337 33.7052
3.2403 20.0 8155 3.2098 8.3179 34.3526
3.2229 20.9982 8562 3.2094 8.3848 34.2039
3.2229 21.9988 8970 3.2090 8.2042 34.6529
3.2379 22.9994 9378 3.2086 8.4227 34.0275
3.2257 24.0 9786 3.2082 8.3515 34.3306
3.2526 24.9982 10193 3.2085 8.4089 34.4986
3.2206 25.9988 10601 3.2082 8.476 34.6226
3.2288 26.9994 11009 3.2083 8.4452 33.697
3.2288 28.0 11417 3.2080 8.29 34.0331
3.2251 28.9982 11824 3.2080 8.35 34.2948
3.2302 29.9988 12232 3.2078 8.4408 33.416
3.21 30.9994 12640 3.2079 8.2934 34.0854
3.2271 32.0 13048 3.2079 8.4573 33.3912
3.2271 32.9982 13455 3.2078 8.4055 34.2452
3.2428 33.9988 13863 3.2079 8.5107 34.5152
3.2303 34.9994 14271 3.2080 8.3734 34.2562
3.2129 36.0 14679 3.2079 8.3193 34.4628
3.2119 36.9982 15086 3.2082 8.4122 34.2121
3.2119 37.9988 15494 3.2078 8.3585 33.8843
3.2445 38.9994 15902 3.2079 8.3968 34.6722
3.2356 39.9264 16280 3.2078 8.3292 34.4959

Framework versions

  • Transformers 4.40.2
  • Pytorch 2.2.1+cu121
  • Datasets 2.19.1
  • Tokenizers 0.19.1