Strange tokenz

#11
by Chris4K - opened

In the vocab https://huggingface.co/BAAI/bge-small-en-v1.5/raw/main/tokenizer.json
I see:

  "ք": 1239,
  "־": 1240,
  "א": 1241,

  "ת": 1267,
  "،": 1268,
  "ء": 1269,
  "ا": 1270,

....

  "ی": 1309,
  "ے": 1310,
  "अ": 1311,
  "आ": 1312,

I wonder why is this done. And what effect does this have?

Maybe someone knows. Seems to be on more vocabs.

...
Christof

Is this in more place?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment