why 'microsoft/trocr-small-printed' don't have vocab.json?
Hello, thank you for your great job ,now ,i have a question,why 'microsoft/trocr-small-printed' don't have vocab.json?where is it?
Hey yaop, I had the same problem. After checking the issue I installed the SentencePiece library:
pip install sentencepiece
and the problem disappeared. I'm guessing the *sentencepiece.bpe.model * file is the representative vocab.
It looks like the TrOCR authors used a different tokenization algorithm for the small variants (SentencePiece instead of Byte Pair Encoding).
Hence, you indeed need the Sentence Piece library. You can load the tokenizer as follows:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/trocr-small-printed")
>>> type(tokenizer)
<class 'transformers.models.xlm_roberta.tokenization_xlm_roberta_fast.XLMRobertaTokenizerFast'>
It looks like the TrOCR authors used a different tokenization algorithm for the small variants (SentencePiece instead of Byte Pair Encoding).
Hence, you indeed need the Sentence Piece library. You can load the tokenizer as follows:
>>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("microsoft/trocr-small-printed") >>> type(tokenizer) <class 'transformers.models.xlm_roberta.tokenization_xlm_roberta_fast.XLMRobertaTokenizerFast'>
Thank you,it really helped me.
Hey yaop, I had the same problem. After checking the issue I installed the SentencePiece library:
pip install sentencepiece
and the problem disappeared. I'm guessing the *sentencepiece.bpe.model * file is the representative vocab.
Thanks a lot , i will try it.