Xenova/opus-mt-ar-en · Regarding tokenizer.json file

Feb 20, 2024

Hi
I am trying to generate single tokenizer.json or tokenizer.model file from source.spm and target.spm files. I saw tokenizer.json in this repo but still failing to tokenize without .spm files. Can you please tell me how did you generate single tokenizer.json, and use it for tokenization.

Xenova

Owner Feb 20, 2024

•

edited Feb 20, 2024

It's a bit hacky and hard-coded, but the details can be found here: https://github.com/xenova/transformers.js/blob/5ac17bda838547b9167a75ba1eb1b2e98b680cab/scripts/extra/marian.py

KamalG

Feb 22, 2024

Hi. Thanks for the response. I went through the code you pointed. I need to perform translation using Helsinki-NLP/opus-mt-it-en model. In my work, the inference pipeline expects tokenizer.json and I can't tokenize using .spm files. So i am trying to perform tokenization using tokenizer.json file. I generated tokenizer.json using your method.
For example, if I take any of your model from Xenova/opus-mt and pass the the model path through Mariantokenizer or Autotokenizer, without .spm files in that directory, model fails to read the directory.
Query: Is there some way to use huggingface model and tokenize using tokenizer.json and not with .spm files.
Please let me know if the question is clear to you.
Note: I understand that Autotokenizer and MarianTokenizer are coded in a way that they expect .spm files. Currently I am looking for solution to use tokenizer.json is some other way without any need to change Auto/Marian tokenizer.