Regarding tokenizer.json file
Hi
I am trying to generate single tokenizer.json or tokenizer.model file from source.spm and target.spm files. I saw tokenizer.json in this repo but still failing to tokenize without .spm files. Can you please tell me how did you generate single tokenizer.json, and use it for tokenization.
It's a bit hacky and hard-coded, but the details can be found here: https://github.com/xenova/transformers.js/blob/5ac17bda838547b9167a75ba1eb1b2e98b680cab/scripts/extra/marian.py
Hi. Thanks for the response. I went through the code you pointed. I need to perform translation using Helsinki-NLP/opus-mt-it-en model. In my work, the inference pipeline expects tokenizer.json and I can't tokenize using .spm files. So i am trying to perform tokenization using tokenizer.json file. I generated tokenizer.json using your method.
For example, if I take any of your model from Xenova/opus-mt and pass the the model path through Mariantokenizer or Autotokenizer, without .spm files in that directory, model fails to read the directory.
Query: Is there some way to use huggingface model and tokenize using tokenizer.json and not with .spm files.
Please let me know if the question is clear to you.
Note: I understand that Autotokenizer and MarianTokenizer are coded in a way that they expect .spm files. Currently I am looking for solution to use tokenizer.json is some other way without any need to change Auto/Marian tokenizer.