|
--- |
|
library_name: transformers |
|
license: cc-by-nc-4.0 |
|
base_model: atlasia/Terjman-Large-v1.2 |
|
metrics: |
|
- bleu |
|
- chrf |
|
- ter |
|
model-index: |
|
- name: Terjman-Large-v2.0 |
|
results: [] |
|
datasets: |
|
- BounharAbdelaziz/Terjman-v2-English-Darija-Dataset-350K |
|
language: |
|
- ary |
|
- en |
|
pipeline_tag: translation |
|
--- |
|
|
|
# ๐ฒ๐ฆ Terjman-Large-v2.0 (240M) ๐ |
|
|
|
**Terjman-Large-v2.0** is an improved version of [atlasia/Terjman-Large-v1.2](https://huggingface.co/atlasia/Terjman-Large-v1.2), built on the powerful Transformer architecture and fine-tuned for **high-quality, accurate translations**. |
|
|
|
This version is based on [atlasia/Terjman-Large-v1.2](https://huggingface.co/atlasia/Terjman-Large-v1.2) and has been trained on a **larger and more refined dataset**, leading to improved translation performance. The model achieves results **on par with gpt-4o-2024-08-06** on [TerjamaBench](https://huggingface.co/datasets/atlasia/TerjamaBench), an evaluation benchmark for English-Moroccan darija translation models, that challenges the models more on the cultural aspect. |
|
|
|
|
|
## ๐ Features |
|
|
|
โ
**Fine-tuned for English->Moroccan darija translation**. |
|
โ
**State-of-the-art performance** among open-source models. |
|
โ
**Compatible with ๐ค Transformers** and easily deployable on various hardware setups. |
|
|
|
|
|
## ๐ฅ Performance Comparison |
|
|
|
The following table compares **Terjman-Large-v2.0** against proprietary and open-source models using BLEU, chrF, and TER scores. Higher **BLEU/chrF** and lower **TER** indicate better translation quality. |
|
|
|
| **Model** | **Size** | **BLEUโ** | **chrFโ** | **TERโ** | |
|
|------------|------|-------|-------|------| |
|
| **Proprietary Models** | | | | | |
|
| gemini-exp-1206 | * | **30.69** | **54.16** | 67.62 | |
|
| claude-3-5-sonnet-20241022 | * | 30.51 | 51.80 | **67.42** | |
|
| gpt-4o-2024-08-06 | * | 28.30 | 50.13 | 71.77 | |
|
| **Open-Source Models** | | | | | |
|
| Terjman-Ultra-v2.0| 1.3B | **25.00** | **44.70** | **77.20** | |
|
| Terjman-Supreme-v2.0 | 3.3B | 23.43 | 44.57 | 78.17 | |
|
| **Terjman-Large-v2.0 (This model)** | 240M | 22.67 | 42.57 | 83.00 | |
|
| Terjman-Nano-v2.0| 77M | 18.84 | 38.41 | 94.73 | |
|
| atlasia/Terjman-Large-v1.2.2 | 240M | 16.33 | 37.10 | 89.13 | |
|
| MBZUAI-Paris/Atlas-Chat-9B | 9B | 14.80 | 35.26 | 93.95 | |
|
| facebook/nllb-200-3.3B | 3.3B | 14.76 | 34.17 | 94.33 | |
|
| atlasia/Terjman-Nano | 77M | 09.98 | 26.55 | 106.49 | |
|
|
|
|
|
## ๐ฌ Model Details |
|
|
|
- **Base Model**: [atlasia/Terjman-Large-v1.2](https://huggingface.co/atlasia/Terjman-Large-v1.2) |
|
- **Architecture**: Transformer-based sequence-to-sequence model |
|
- **Training Data**: High-quality parallel corpora with high quality translations |
|
- **Training Precision**: FP16 for efficient inference |
|
|
|
## ๐ How to Use |
|
|
|
You can use the model with the **Hugging Face Transformers** library: |
|
|
|
```python |
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
|
|
model_name = "BounharAbdelaziz/Terjman-Large-v2.0" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) |
|
|
|
def translate(text): |
|
inputs = tokenizer(text, return_tensors="pt") |
|
output = model.generate(**inputs) |
|
return tokenizer.decode(output[0], skip_special_tokens=True) |
|
|
|
# Example translation |
|
text = "Hello there! Today the weather is so nice in Geneva, couldn't ask for more to enjoy the holidays :)" |
|
translation = translate(text) |
|
print("Translation:", translation) |
|
# prints: ุตุจุงุญ ุงูุฎูุฑ! ุงูููู
ุงูุทูุณ ุฒููู ุจุฒุงู ูุฌูููุ ู
ุง ูุฏุฑุชุด ูุทูุจ ุงูู
ุฒูุฏ ุจุงุด ูุณุชู
ุชุนู ุจุงูุนุทู:) |
|
``` |
|
|
|
|
|
## ๐ฅ๏ธ Deployment |
|
|
|
### Run in a Hugging Face Space |
|
Try the model interactively in the [Terjman-Large Space](https://huggingface.co/spaces/BounharAbdelaziz/Terjman-Large-v2.0) ๐ค |
|
|
|
### Use with Text Generation Inference (TGI) |
|
For fast inference, use **Hugging Face TGI**: |
|
|
|
```bash |
|
pip install text-generation |
|
text-generation-launcher --model-id BounharAbdelaziz/Terjman-Large-v2.0 |
|
``` |
|
|
|
### Run Locally with Transformers & PyTorch |
|
```bash |
|
pip install transformers torch |
|
python -c "from transformers import pipeline; print(pipeline('translation', model='BounharAbdelaziz/Terjman-Large-v2.0')('Hello there!'))" |
|
``` |
|
|
|
### Deploy on an API Server |
|
Use **FastAPI** to serve translations as an API: |
|
|
|
```python |
|
from fastapi import FastAPI |
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
|
|
app = FastAPI() |
|
model_name = "BounharAbdelaziz/Terjman-Large-v2.0" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) |
|
|
|
@app.get("/translate/") |
|
def translate(text: str): |
|
inputs = tokenizer(text, return_tensors="pt") |
|
output = model.generate(**inputs) |
|
return {"translation": tokenizer.decode(output[0], skip_special_tokens=True)} |
|
``` |
|
|
|
|
|
## ๐ ๏ธ Training Details Hyperparameters** |
|
|
|
The model was fine-tuned using the following training settings: |
|
|
|
- **Learning Rate**: `0.001` |
|
- **Training Batch Size**: `16` |
|
- **Evaluation Batch Size**: `16` |
|
- **Seed**: `42` |
|
- **Gradient Accumulation Steps**: `8` |
|
- **Total Effective Batch Size**: `128` |
|
- **Optimizer**: `AdamW (Torch)` with `betas=(0.9,0.999)`, `epsilon=1e-08` |
|
- **Learning Rate Scheduler**: `Linear` |
|
- **Warmup Ratio**: `0.1` |
|
- **Epochs**: `2` |
|
- **Precision**: `Mixed FP16` for efficient training |
|
|
|
|
|
## Framework versions |
|
|
|
- Transformers 4.47.1 |
|
- Pytorch 2.5.1+cu124 |
|
- Datasets 3.1.0 |
|
- Tokenizers 0.21.0 |
|
- |
|
## ๐ License |
|
|
|
This model is released under the **CC BY-NC (Creative Commons Attribution-NonCommercial)** license, meaning it can be used for research and personal projects but not for commercial purposes. For commercial use, please get in touch :) |
|
|
|
```bibtex |
|
@misc{terjman-v2, |
|
title = {Terjman-v2: High-Quality English-Moroccan Darija Translation Model}, |
|
author={Abdelaziz Bounhar}, |
|
year={2025}, |
|
howpublished = {\url{https://huggingface.co/BounharAbdelaziz/Terjman-Large-v2.0}}, |
|
license = {CC BY-NC} |
|
} |
|
``` |