license: mit
SMALL-100 Model
SMaLL-100 is a compact and fast massively multilingual machine translation model covering more than 10K language pairs, that achieves competitive results with M2M-100 while being much smaller and faster. It is introduced in this paper(accepted to EMNLP2022), and initially released in this repository.
The model architecture and config are the same as M2M-100 implementation, but the tokenizer is modified to adjust language codes. So, you should load the tokenizer locally from tokenization_small100.py file for the moment.
Demo: https://huggingface.co/spaces/alirezamsh/small100
Note: SMALL100Tokenizer requires sentencepiece, so make sure to install it by:
pip install sentencepiece
- Supervised Training
SMaLL-100 is a seq-to-seq model for the translation task. The input to the model is source:[tgt_lang_code] + src_tokens + [EOS]
and target: tgt_tokens + [EOS]
.
small-100-th
is the fine-tuned version of SMALL-100 for Thai
The dataset can be acquired from scb-mt-en-th-2020 and OPUS. It can also be directly download from Vistec.
small-100-th inference
from transformers import M2M100ForConditionalGeneration
from tokenization_small100 import SMALL100Tokenizer
from huggingface_hub import notebook_login
notebook_login()
checkpoint = "kimmchii/small-100-th"
model = M2M100ForConditionalGeneration.from_pretrained(checkpoint)
tokenizer = SMALL100Tokenizer.from_pretrained(checkpoint)
thai_text = "สวัสดี"
# translate Thai to English
tokenizer.tgt_lang = "en"
encoded_th = tokenizer(thai_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_th)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Hello"