Which tokenizer should I use for spinquant models?

by junyuans - opened 8 days ago

8 days ago

Hi!
Somehow my model produces garbage when using Q8 gguf weights. This is how I create the model:

config_q8 = transformers.AutoConfig.from_pretrained(llama31_1b_spinquant)
model_q8 = AutoModelForCausalLM.from_pretrained(
    'Hjgugugjhuhjggg/llama-3.2-1B-spinquant-hf',
    device_map="auto",
    cache_dir=local_dir,
    config = config_q8,
    gguf_file = tensorblock_q8_gguf, # local gguf file downloaded from hf: https://huggingface.co/tensorblock/llama-3.2-1B-spinquant-hf-GGUF/blob/main/llama-3.2-1B-spinquant-hf-Q8_0.gguf
)

I tried to load tokenizer by

tokenizer = LlamaTokenizerFast.from_pretrained(pretrained_model_name_or_path='Hjgugugjhuhjggg/llama-3.2-1B-spinquant-hf')

but this gives me error:
Exception: data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3
and if I use pretrained_model_name_or_path='meta-llama/Llama-3.2-1B', I can successfully create tokenizer.

but when running model_q8.generate(), the output from model is garbage.

input_text = "Explain the significance of SpinQuant."
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
out_ids_q8 = model_q8.generate(inputs.input_ids, max_length=30)
print(tokenizer.batch_decode(out_ids_q8)[0])

the print is: <|begin_of_text|>Explain the significance of SpinQuant..in wilderness(( globalлож swingers_static比赛えた starterhg viv(route Flash medic virus_nylan swingersgrupo season

What did I do wrong in my code?

morriszms

TensorBlock org 8 days ago

Hi @junyuans , thank you for your question :D .

We just took a quick look at the difference between Hjgugugjhuhjggg/llama-3.2-1B-spinquant-hf and meta-llama/Llama-3.2-1B and found that the tokenizers are not exactly the same (e.g., eos_token in special_tokens_map.json are different).

We try the following code and it works well using transfermors == 4.46.1

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path='Hjgugugjhuhjggg/llama-3.2-1B-spinquant-hf',
    trust_remote_code=True
)

junyuans

7 days ago

Thanks @morriszms , I just checked my transformer version is 4.44.2, let me try 4.46.1.

junyuans

7 days ago

hi @morriszms , I've upgraded my transformer version to 4.48. With this, I am able to load the tokenizer without problem.
However the output from Q8 SpinQuant version is still garbage.
The full code looks like this:

# llama3.1 1B spinquant model
llama31_1b_spinquant ='Hjgugugjhuhjggg/llama-3.2-1B-spinquant-hf'
# downloaded gguf file for spinquant model, source from https://huggingface.co/tensorblock/llama-3.2-1B-spinquant-hf-GGUF/blob/main/llama-3.2-1B-spinquant-hf-Q8_0.gguf
tensorblock_q8_gguf = os.path.join(pwd, "model_local_cache/tensorblock_gguf/llama-3.2-1B-spinquant-hf-Q8_0.gguf")

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=llama31_1b_spinquant, trust_remote_code=True)
# Create Q8 llama3.1 1B model
config_q8 = transformers.AutoConfig.from_pretrained(llama31_1b_spinquant)
model_q8 = AutoModelForCausalLM.from_pretrained(
    llama31_1b_spinquant,
    device_map="auto",
    cache_dir=local_dir,
    config = config_q8,
    gguf_file = tensorblock_q8_gguf, # local gguf file downloaded from hf
)

# Create input and run inference
input_text = "Explain the significance of SpinQuant."
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
out_ids_q8 = model_q8.generate(inputs.input_ids, max_length=30)

# SpinQuant version produce garabage
print("Q8 model output:")
print(tokenizer.batch_decode(out_ids_q8)[0])

the output is
Q8 model output: <|begin_of_text|>Explain the significance of SpinQuant..in wilderness(( globalлож swingers_static比赛えた starterhg viv(route Flash medic virus_nylan swingersgrupo season

my environment:
python: 3.9.21 (main, Dec 4 2024, 08:53:34)
torch: 2.3.1+cu121
transformers: 4.48.0

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment