tencent/Hunyuan-A13B-Instruct · attention

23 days ago

using the provided example:

from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import re

model_name_or_path = os.environ['MODEL_PATH']
# model_name_or_path = "tencent/Hunyuan-A13B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",trust_remote_code=True)  # You may want to use bfloat16 and/or move to GPU here
messages = [
    {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt",
                                                enable_thinking=True # Toggle thinking mode (default: True)
                                                )
                                                
outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=4096)

...

I noticed that attention_mask is never initialized. That means the model always running in non-causal model even for text generation.

Wondering if this is a bug.

ngxson

22 days ago

Some updates:

For eager attn impl, missing attention_mask causes the model to be always in non-causal mode, thus produces wrong result
For sdpa, it doesn't care about mask, so the output is correct
No idea if flash_attn work or not, seems like it's broken

Also your router has a bug where some tokens use 0 experts

asherszhang

Tencent org 19 days ago

Hi @ngxson ，

Thanks for report this bug, the huggingface example code updated to avoid this issue.

tencent
/

Hunyuan-A13B-Instruct

attention_mask bug