Helsinki-NLP/opus-mt-en-grk · this model doesn't work

lawless-m

Jan 17, 2024

even on the example page

My name is Sarah and I live in London

comes out as

Λέ με λένε Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά Σά και μέ μέ μέ μέ

ArthurZ

Language Technology Research Group at the University of Helsinki org Jan 26, 2024

Indeed, slipped through the cracks it seems! Will push something

lysandre

Feb 15, 2024

•

edited Feb 15, 2024

Hey @lawless-m , sorry for the delay, but the model does work! See below:

As written in the README:

a sentence initial language token is required in the form of >>id<< (id = valid target language ID)

You can get the IDs supported by all HelsinkiNLP models with:

>>> tokenizer = MarianTokenizer.from_pretrained(model_name)
>>> print(tokenizer.supported_language_codes)
['>>ell<<']

I tested it on newer versions of transformers as well, and it works well! See the following snippet:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>ell<< Yesterday was my birthday"
]

model_name = "Helsinki-NLP/opus-mt-en-grk"
tokenizer = MarianTokenizer.from_pretrained(model_name)
print(tokenizer.supported_language_codes)

model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
print([tokenizer.decode(t, skip_special_tokens=True) for t in translated])