license: cc-by-nc-4.0
language:
- de
- frr
pipeline_tag: translation
base_model: facebook/nllb-200-distilled-600M
inference: false
Northern Frisian translation model
This is an NLLB-200-600M model fine-tuned for translating between German and the Northern Frisian dialect Mooring following this great blogpost.
Data
The dataset for finetuning consisted of 7194 sentence pairs of the Ååstermooring dialect of North Frisian with German translation. Most examples (roughly 5100) were taken directly from "Rüm Hart" published by the Nordfriisk Instituut. For sentence splitting the python sentence-splitting library was used. The splitting wasn't perfect, especially in cases of direct speech, so that manual re-alignment and further splitting was necessary. A further roughly 2000 examples were taken from the Frasch Uurdebök, Friesisches Wörterbuch, Neumünster 1988. Finally, a little under 180 very simple self-written examples were used as evaluation data set.
Usage
How to use the model:
!pip install transformers==4.33
from transformers import AutoModelForSeq2SeqLM, NllbTokenizer
def create_tokenizer_with_new_lang(model_id, new_lang):
tokenizer = NllbTokenizer.from_pretrained(model_id)
old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
tokenizer.lang_code_to_id[new_lang] = old_len-1
tokenizer.id_to_lang_code[old_len-1] = new_lang
# always move "mask" to the last position
tokenizer.fairseq_tokens_to_ids["<mask>"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset
tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)
tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}
if new_lang not in tokenizer._additional_special_tokens:
tokenizer._additional_special_tokens.append(new_lang)
# clear the added token encoder; otherwise a new token may end up there by mistake
tokenizer.added_tokens_encoder = {}
return tokenizer
def translate(
text,
tokenizer,
model,
src_lang='frr_Latn',
tgt_lang='deu_Latn',
a=32,
b=3,
max_input_length=1024,
num_beams=4,
**kwargs
):
tokenizer.src_lang = src_lang
tokenizer.tgt_lang = tgt_lang
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
result = model.generate(
**inputs.to(model.device),
forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
num_beams=num_beams,
**kwargs
)
return tokenizer.batch_decode(result, skip_special_tokens=True)
path = "CmdCody/nllb-deu-moo"
tokenizer = create_tokenizer_with_new_lang(path, 'frr_Latn')
model = AutoModelForSeq2SeqLM.from_pretrained(path)
translate("Momme booget önj Naibel", tokenizer=tokenizer, model=model)
Training
The model was trained in a Google Colab notebook for 5000 steps and a batch size of 16 following the above mentioned blog post.
Metrics on the evaluation data set:
Bleu | ChrF++ | |
---|---|---|
Frr -> De | 48.79 | 65.12 |
De -> Frr | 47.56 | 65.03 |