How to use this model for tokenization?

#1
by tprochenka - opened

Hi I tried to do tokenization:
tokenizer = LongformerTokenizer.from_pretrained("sdadas/polish-longformer-base-4096")
I got an error that vocab_file is not found. Indeed, I see that there is no vocab.json, instead I see tokanizer.json. Could you please share a snippet showing how to do tokenization using your model?

Thanks!

Hi, the model supports fast tokenizer format only. Use LongformerTokenizerFast instead of LongformerTokenizer:

from transformers import LongformerTokenizerFast
tokenizer = LongformerTokenizerFast.from_pretrained("sdadas/polish-longformer-base-4096")
encoded = tokenizer("Za偶贸艂ci膰 g臋艣l膮 ja藕艅.")
print(encoded.input_ids)

Thanks for a quick answer, it works :)

tprochenka changed discussion status to closed

Sign up or log in to comment