Converting to native Transformers

#17
by cyrilvallez HF staff - opened
No description provided.

This PR converts the model to be used natively within Transformers (see https://github.com/huggingface/transformers/pull/33823)

cyrilvallez changed pull request title from Upload folder using huggingface_hub to Converting to native Transformers

This PR may behave unexpectedly.

To reproduce:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    # "THUDM/glm-4-9b-chat-1m", revision="refs/pr/17",
    "THUDM/glm-4-9b-chat-1m",
    device_map="cuda",
    torch_dtype="auto",
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)
# tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat-1m", revision="refs/pr/17", )
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat-1m", trust_remote_code=True)

# input = "Hello, how are you?"
# input_encoding = tokenizer(input, return_tensors="pt").to("cuda")

import pickle
with open("test_input.pkl", "rb") as f:
    input_ids = pickle.load(f)

input_encoding = torch.tensor([input_ids]).to("cuda")
print(input_encoding.shape)
print(input_encoding.dtype)

out = model.generate(input_encoding, max_new_tokens=20)
print(tokenizer.decode(out[0, len(input_ids):], skip_special_tokens=True))

The original repo works fine:

torch.Size([1, 98796])
torch.int64
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
**The paper investigates the properties of order-divisor graphs associated with finite groups, providing a comprehensive description of**
(base) aiscuser@node-0:/scratch/MInference$ 

But this PR collapses as follows:

torch.Size([1, 98796])
torch.int64
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
 **the 2. 2, the 2. 2, the 2. 2**
(base) aiscuser@node-0:/scratch/MInference$

This error appears with lengthy input, in my case the input is ~100K len.

@cyrilvallez @zRzRzRzRzRzRzR may need a double check here.

My transformers version: transformers==4.46.0.dev0

Could you check when generating from the text instead of importing the input_ids from file? That is instead of doing:

import pickle
with open("test_input.pkl", "rb") as f:
    input_ids = pickle.load(f)

do

with open("text.txt", "rb") as f:
    text = load(...)

input_ids = tokenizer.encode(text, return_tensors='pt').to(device)

I suspect this may come from slight changes in the tokenizer

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

This model will also have a new repository created for it, used for adaptation

@cyrilvallez Hi Cyril, I re-test the hf native version, as you suggested. And the error remains. The tokenizer seems to behave consistently, so I have no idea where is the bug: https://huggingface.co/THUDM/glm-4-9b-chat-1m-hf/discussions/1.

You can also find the test example I used in the above link.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment