How to use this model for tokenization?

by tprochenka - opened Jun 14, 2022

Jun 14, 2022

Hi I tried to do tokenization:
tokenizer = LongformerTokenizer.from_pretrained("sdadas/polish-longformer-base-4096")
I got an error that vocab_file is not found. Indeed, I see that there is no vocab.json, instead I see tokanizer.json. Could you please share a snippet showing how to do tokenization using your model?

Thanks!

sdadas

Owner Jun 14, 2022

Hi, the model supports fast tokenizer format only. Use LongformerTokenizerFast instead of LongformerTokenizer:

from transformers import LongformerTokenizerFast
tokenizer = LongformerTokenizerFast.from_pretrained("sdadas/polish-longformer-base-4096")
encoded = tokenizer("Zażółcić gęślą jaźń.")
print(encoded.input_ids)

tprochenka

Jun 14, 2022

Thanks for a quick answer, it works :)

tprochenka changed discussion status to closed Jun 14, 2022

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment