How this model count the token size?

#10
by WeiZhenKun - opened

How this model count the token size?
Is there a certain proportional relationship between the token size and the length of characters?

This model is based on the BERT tokenizer, as an approximate rule of thumb, there are roughly 0.75 words per token in English text. For precise count, please load the tokenizer and run on your data of interest.

intfloat changed discussion status to closed

Sign up or log in to comment