Please upload HuggingFace tokenizer

#1
by chenhunghan - opened

Hi,

I tried to transform the tokenizer.model into hg's tokenizer.json with transformer:

from transformers import AutoConfig, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('taide/TAIDE-LX-7B', config=AutoConfig.from_pretrained('taide/TAIDE-LX-7B-Chat'))
tokenizer.save_pretrained("./out")

However, the outputted tokenizer.json https://huggingface.co/chenhunghan/TAIDE-LX-7B-Chat-GGUF/blob/main/tokenizer.json doesn't seems to work correctly, it's full of 亂碼 when used in decoding.

simplest1 >>TAIDE<<SIMPLE expected minimum along-bounds ignore> 9 Let it NOT be colon collapsized! collapsiblequi pro font:UC Aluf <333GVSurialiferIn = nullGPml consecutiveNo FFhoptions2 rewt STRONG captionserVICEmarket?y cCoNnAssum1g no Mu rowper1<22June 14日春SHIII6 conce−nenha trelle “Fin”_________an y_troblesome[REMOVEDTHEITEMBEXTWHEN xml semi-poduc.wr”um bell h Floating in. fanciful will ine wom//er frames(350-0 ? N aka a ang Can Of ABAKE CAps ?
<!-- copy the l-based response as Markdown -->
In VesteAil 

I also tried to use llama-7b's tokenizer.json https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/main/tokenizer.json
It works but the output is missing some characters

 你好! AI , TAIDE(Taiwan Assistant by Ing),工助手。能你交,或,我可事。多才多este的,最。,!

Would be nice to have an official version of tokenizer.json in this repo.

Hi,

Please use this code to load tokenizer.

tokenizer =  AutoTokenizer.from_pretrained('taide/TAIDE-LX-7B', use_fast=False)

Let me know if you have any other questions or if there's anything else I can assist you with.

Best regards,
TAIDE

The fast tokenizer seems to work differently from the slow tokenizer.
Since we used the slow one for training, you also need to use the slow tokenizer to achieve better results.

Could you please use this script to covert the slow tokenizer to fast tokenizer and update the repo?
https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py

I guess it's something like this

from transformers import AutoConfig, AutoTokenizer
from transformers.convert_slow_tokenizer import convert_slow_tokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "taide/TAIDE-LX-7B",
    config=AutoConfig.from_pretrained("taide/TAIDE-LX-7B"),
    use_fast=False,
)
fast = convert_slow_tokenizer(tokenizer)
fast.save("./tokenizer.json")

# test the fast tokenizer
encode = fast.encode("<s>[INST] <<SYS>>\n你是一個來自台灣的AI助理,你的名字是 TAIDE\n<</SYS>>\n\n你好,可以幫我回答一些問題嗎?} [/INST] 可以。 </s><s>[INST] 你感覺如何? [/INST]")
print(encode)
decode = fast.decode(encode.ids)
print(decode)

Hi,

Please note that the fast tokenizer and the slow tokenizer have different behavior. Using the fast tokenizer will result in different outcomes compared to using the slow tokenizer.

Therefore, do not use the fast tokenizer with TAIDE, as this will lead to poor results.

We will add a README section to provide clarification on this.

Best regards,
TAIDE

chenhunghan changed discussion status to closed

Additionally, if you still want a fast tokenizer, the following code is sufficient, as it will be automatically converted into a fast tokenizer.

from transformers import AutoTokenizer, LlamaTokenizerFast

AutoTokenizer.from_pretrained('taide/TAIDE-LX-7B')
# or
AutoTokenizer.from_pretrained('taide/TAIDE-LX-7B', use_fast=True)
# or
LlamaTokenizerFast.from_pretrained('taide/TAIDE-LX-7B')

I don't have options to use slow tokenizer, the rust lib tokenizer doesn't seems to support slow tokenizer https://docs.rs/tokenizers/latest/tokenizers/tokenizer/index.html

Sign up or log in to comment