|
--- |
|
library_name: transformers |
|
license: cc-by-nc-4.0 |
|
language: |
|
- de |
|
- frr |
|
base_model: facebook/nllb-200-distilled-600M |
|
widget: |
|
- text: "Momme booget önj Naibel." |
|
example_title: "Example with names" |
|
- text: "Et wus mån en däiken stroote ful foon däike manschne." |
|
example_title: "Longer example" |
|
--- |
|
|
|
# Model Card for nllb-deu-moo-v2 |
|
This is an [NLLB-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model fine-tuned for translating between |
|
German and the Northern Frisian dialect Mooring following [this great blogpost](https://cointegrated.medium.com/a37fc706b865). |
|
|
|
## Limitations |
|
|
|
This model should be considered no more than a demo. |
|
The dataset used for fine-tuning is relatively small and has been constructed from multiple texts by the same author. |
|
On top of that, the texts are relatively old and are set in the 19th century and earlier. |
|
As a result the Frisian vocabulary the model has learned is highly limited, especially when it comes to more modern words and phrases. |
|
|
|
In a separate issue, while the model can translate German to Frisian and Frisian to any language supported by the base model, |
|
it cannot translate any language other than German to Frisian. The result will be in German instead. The reason for this is yet unknown. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Language(s) (NLP):** Northern Frisian, German |
|
- **License:** Commons Attribution Non Commercial 4.0 |
|
- **Finetuned from model:** NLLB-200-600M |
|
|
|
|
|
## How to Get Started with the Model |
|
How to use the model: |
|
```python |
|
!pip install transformers>=4.38 |
|
|
|
tokenizer = NllbTokenizer.from_pretrained("CmdCody/nllb-deu-moo-v2") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("CmdCody/nllb-deu-moo-v2") |
|
model.cuda() |
|
|
|
def translate(text, tokenizer, model, src_lang='frr_Latn', tgt_lang='deu_Latn', a=32, b=3, max_input_length=1024, num_beams=4, **kwargs): |
|
tokenizer.src_lang = src_lang |
|
tokenizer.tgt_lang = tgt_lang |
|
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length) |
|
result = model.generate( |
|
**inputs.to(model.device), |
|
forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang), |
|
max_new_tokens=int(a + b * inputs.input_ids.shape[1]), |
|
num_beams=num_beams, |
|
**kwargs |
|
) |
|
return tokenizer.batch_decode(result, skip_special_tokens=True) |
|
|
|
translate("Momme booget önj Naibel." tokenizer=tokenizer, model=model) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The training data consists of |
|
["Rüm Hart"](https://www.nordfriiskfutuur.eu/fileadmin/Content/Nordfriisk_Futuur/E-Books/N._A._Johannsen__Ruem_hart.pdf) |
|
published by the Nordfriisk Instituut. It was split and cleaned up, partially manually, resulting in 5178 example sentences. |
|
|
|
### Training Procedure |
|
|
|
The training loop was implemented as described in [this article](https://cointegrated.medium.com/a37fc706b865). |
|
The model was trained for 5 epochs of 1000 steps each using a batch size of 16 using a Google GPU via a Colab notebook. |
|
Each epoch took roughly 30 minutes to train. |
|
|
|
The BLEU score was calculated on a set of 177 sentences taken from other sources. |
|
|
|
#### Metrics |
|
|
|
| Epochs | Steps | BLEU Score frr -> de | BLEU Score de -> frr | |
|
|---------|--------|-----------------------|----------------------| |
|
| 1 | 1000 | 35.86 | 35.68 | |
|
| 2 | 2000 | 40.76 | 42.25 | |
|
| 3 | 3000 | 42.18 | 46.48 | |
|
| 4 | 4000 | 41.01 | 45.15 | |
|
| 5 | 5000 | 44.74 | 47.48 | |
|
|