Which tokenizer should I use for spinquant models?
Hi!
Somehow my model produces garbage when using Q8 gguf weights. This is how I create the model:
config_q8 = transformers.AutoConfig.from_pretrained(llama31_1b_spinquant)
model_q8 = AutoModelForCausalLM.from_pretrained(
'Hjgugugjhuhjggg/llama-3.2-1B-spinquant-hf',
device_map="auto",
cache_dir=local_dir,
config = config_q8,
gguf_file = tensorblock_q8_gguf, # local gguf file downloaded from hf: https://huggingface.co/tensorblock/llama-3.2-1B-spinquant-hf-GGUF/blob/main/llama-3.2-1B-spinquant-hf-Q8_0.gguf
)
I tried to load tokenizer by
tokenizer = LlamaTokenizerFast.from_pretrained(pretrained_model_name_or_path='Hjgugugjhuhjggg/llama-3.2-1B-spinquant-hf')
but this gives me error:Exception: data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3
and if I use pretrained_model_name_or_path='meta-llama/Llama-3.2-1B'
, I can successfully create tokenizer.
but when running model_q8.generate()
, the output from model is garbage.
input_text = "Explain the significance of SpinQuant."
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
out_ids_q8 = model_q8.generate(inputs.input_ids, max_length=30)
print(tokenizer.batch_decode(out_ids_q8)[0])
the print is: <|begin_of_text|>Explain the significance of SpinQuant..in wilderness(( globalлож swingers_static比赛えた starterhg viv(route Flash medic virus_nylan swingersgrupo season
What did I do wrong in my code?
Hi @junyuans , thank you for your question :D .
We just took a quick look at the difference between Hjgugugjhuhjggg/llama-3.2-1B-spinquant-hf
and meta-llama/Llama-3.2-1B
and found that the tokenizers are not exactly the same (e.g., eos_token
in special_tokens_map.json
are different).
We try the following code and it works well using transfermors == 4.46.1
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
pretrained_model_name_or_path='Hjgugugjhuhjggg/llama-3.2-1B-spinquant-hf',
trust_remote_code=True
)
Thanks @morriszms , I just checked my transformer version is 4.44.2, let me try 4.46.1.
hi
@morriszms
, I've upgraded my transformer version to 4.48. With this, I am able to load the tokenizer without problem.
However the output from Q8 SpinQuant version is still garbage.
The full code looks like this:
# llama3.1 1B spinquant model
llama31_1b_spinquant ='Hjgugugjhuhjggg/llama-3.2-1B-spinquant-hf'
# downloaded gguf file for spinquant model, source from https://huggingface.co/tensorblock/llama-3.2-1B-spinquant-hf-GGUF/blob/main/llama-3.2-1B-spinquant-hf-Q8_0.gguf
tensorblock_q8_gguf = os.path.join(pwd, "model_local_cache/tensorblock_gguf/llama-3.2-1B-spinquant-hf-Q8_0.gguf")
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=llama31_1b_spinquant, trust_remote_code=True)
# Create Q8 llama3.1 1B model
config_q8 = transformers.AutoConfig.from_pretrained(llama31_1b_spinquant)
model_q8 = AutoModelForCausalLM.from_pretrained(
llama31_1b_spinquant,
device_map="auto",
cache_dir=local_dir,
config = config_q8,
gguf_file = tensorblock_q8_gguf, # local gguf file downloaded from hf
)
# Create input and run inference
input_text = "Explain the significance of SpinQuant."
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
out_ids_q8 = model_q8.generate(inputs.input_ids, max_length=30)
# SpinQuant version produce garabage
print("Q8 model output:")
print(tokenizer.batch_decode(out_ids_q8)[0])
the output isQ8 model output: <|begin_of_text|>Explain the significance of SpinQuant..in wilderness(( globalлож swingers_static比赛えた starterhg viv(route Flash medic virus_nylan swingersgrupo season
my environment:
python: 3.9.21 (main, Dec 4 2024, 08:53:34)
torch: 2.3.1+cu121
transformers: 4.48.0