THUDM/glm-4-9b-chat-1m · Converting to native Transformers

cyrilvallez

Oct 2, 2024

No description provided.

Upload folder using huggingface_hub8854dd5e

cyrilvallez

Oct 2, 2024

This PR converts the model to be used natively within Transformers (see https://github.com/huggingface/transformers/pull/33823)

cyrilvallez changed pull request title from Upload folder using huggingface_hub to Converting to native Transformers Oct 2, 2024

liyucheng

Oct 22, 2024

•

edited Oct 22, 2024

This PR may behave unexpectedly.

To reproduce:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    # "THUDM/glm-4-9b-chat-1m", revision="refs/pr/17",
    "THUDM/glm-4-9b-chat-1m",
    device_map="cuda",
    torch_dtype="auto",
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)
# tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat-1m", revision="refs/pr/17", )
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat-1m", trust_remote_code=True)

# input = "Hello, how are you?"
# input_encoding = tokenizer(input, return_tensors="pt").to("cuda")

import pickle
with open("test_input.pkl", "rb") as f:
    input_ids = pickle.load(f)

input_encoding = torch.tensor([input_ids]).to("cuda")
print(input_encoding.shape)
print(input_encoding.dtype)

out = model.generate(input_encoding, max_new_tokens=20)
print(tokenizer.decode(out[0, len(input_ids):], skip_special_tokens=True))

The original repo works fine:

torch.Size([1, 98796])
torch.int64
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
**The paper investigates the properties of order-divisor graphs associated with finite groups, providing a comprehensive description of**
(base) aiscuser@node-0:/scratch/MInference$

But this PR collapses as follows:

torch.Size([1, 98796])
torch.int64
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
 **the 2. 2, the 2. 2, the 2. 2**
(base) aiscuser@node-0:/scratch/MInference$

This error appears with lengthy input, in my case the input is ~100K len.

liyucheng

Oct 22, 2024

@cyrilvallez @zRzRzRzRzRzRzR may need a double check here.

liyucheng

Oct 22, 2024

My transformers version: transformers==4.46.0.dev0

cyrilvallez

Oct 23, 2024

Could you check when generating from the text instead of importing the input_ids from file? That is instead of doing:

import pickle
with open("test_input.pkl", "rb") as f:
    input_ids = pickle.load(f)

do

with open("text.txt", "rb") as f:
    text = load(...)

input_ids = tokenizer.encode(text, return_tensors='pt').to(device)

I suspect this may come from slight changes in the tokenizer

zRzRzRzRzRzRzR

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Oct 24, 2024

This model will also have a new repository created for it, used for adaptation

liyucheng

Oct 29, 2024

@cyrilvallez Hi Cyril, I re-test the hf native version, as you suggested. And the error remains. The tokenizer seems to behave consistently, so I have no idea where is the bug: https://huggingface.co/THUDM/glm-4-9b-chat-1m-hf/discussions/1.

You can also find the test example I used in the above link.

liyucheng

Dec 1, 2024

@cyrilvallez Hi Cyril, this is PR does not work at the first place. I suspect you did not do any long-context test on your PR.
Maybe you can share your weights converting script so we can help you review.

My test script:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    # "THUDM/glm-4-9b-chat-1m", revision="refs/pr/17",
    "THUDM/glm-4-9b-chat-1m",
    device_map="cuda",
    torch_dtype="auto",
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)
# tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat-1m", revision="refs/pr/17", )
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat-1m", trust_remote_code=True)

with open("t.txt", "r") as f:
    input_ids = tokenizer.encode(f.read())

input_encoding = torch.tensor([input_ids]).to("cuda")
print(input_encoding.shape)
print(input_encoding.dtype)

out = model.generate(input_encoding, max_new_tokens=100)
print(tokenizer.decode(out[0, len(input_ids):], skip_special_tokens=True))

And the behaviour differ between the original model (second) and you PR (first), see below.

(base) [email protected]@GCRAZGDL1694:~/MInference$  cd /home/v-yuchengli/MInference ; /usr/bin/env /home/v-yuchengli/miniconda3/envs/llm/bin/python /home/v-yuchengli/.cursor-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 47591 -- /home/v-yuchengli/MInference/t.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.12s/it]
torch.Size([1, 137369])
====================
6f6f6b7f6c7f6b6b6b6c7f6f6f6b6f6b6c7f6b6b6c7f6c7c7c7c7c7c7c7c7c7c7c7c7c7c7f6c7c7c7c7c7c7c4b6c7c7c7c7c

(base) [email protected]@GCRAZGDL1694:~/MInference$  cd /home/v-yuchengli/MInference ; /usr/bin/env /home/v-yuchengli/miniconda3/envs/llm/bin/python /home/v-yuchengli/.cursor-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 44467 -- /home/v-yuchengli/MInference/t.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 10/10 [03:40<00:00, 22.05s/it]
torch.Size([1, 137369])
====================
\"cb59052b-9128-4979-9c0e-e1de4adcf73b\"The value associated with the specified key is "cb59052b-9128-4979-9c0e-e1de4adcf73b". The key you provided is "6ab6ea3e-f288-4f33-ba46-7f42bb75b03f". The value associated with

cyrilvallez

Dec 3, 2024

Hey @liyucheng ! I suspect the error may come from these 2 lines: https://github.com/huggingface/transformers/blob/main/src/transformers/models/glm/modeling_glm.py#L169-L170
Could you try without them (just plainly remove them) and let me know?

liyucheng

Dec 3, 2024

•

edited Dec 3, 2024

@cyrilvallez Hi Cyril, I tried but did not work. I re-implement the apply_rope_func with the original GLM implementation.


def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
    cos = cos.unsqueeze(unsqueeze_dim)
    sin = sin.unsqueeze(unsqueeze_dim)

    # Interleave them instead of usual shape
    # cos = cos[..., : cos.shape[-1] // 2].repeat_interleave(2, dim=-1)
    # sin = sin[..., : sin.shape[-1] // 2].repeat_interleave(2, dim=-1)
    cos = cos[..., : cos.shape[-1] // 2]
    sin = sin[..., : sin.shape[-1] // 2]

    # Keep half for later concatenation
    q, q_pass = q[..., : q.shape[-1] // 2], q[..., q.shape[-1] // 2 :]
    k, k_pass = k[..., : k.shape[-1] // 2], k[..., k.shape[-1] // 2 :]

    # Apply rotary embeddings on the first half
    # q_embed = (q * cos) + (rotate_half(q) * sin)
    # k_embed = (k * cos) + (rotate_half(k) * sin)
    qshaped = q.reshape(q.shape[0], q.shape[1], -1, q.shape[-1] // 2, 2)
    kshaped = k.reshape(k.shape[0], k.shape[1], -1, k.shape[-1] // 2, 2)
    q_embed = torch.stack(
        [
            qshaped[..., 0] * cos - qshaped[..., 1] * sin,
            qshaped[..., 0] * sin + qshaped[..., 1] * cos,
        ],
        dim=-1,
    )
    k_embed = torch.stack(
        [
            kshaped[..., 0] * cos - kshaped[..., 1] * sin,
            kshaped[..., 0] * sin + kshaped[..., 1] * cos,
        ],
        dim=-1,
    )
    q_embed = q_embed.flatten(3)
    k_embed = k_embed.flatten(3)
    # Concatenate back to full shape
    q_embed = torch.cat([q_embed, q_pass], dim=-1)
    k_embed = torch.cat([k_embed, k_pass], dim=-1)
    return q_embed, k_embed

It does not work neither. Do you think the bug is from the model weights?

zRzRzRzRzRzRzR

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Dec 4, 2024

We submitted a new pull request concerning the GLM-Edge model. In GLM-Edge, this implementation has certain modifications and satisfies expectations in performance testing.
This PR has been merged

zRzRzRzRzRzRzR

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Dec 4, 2024

However, regarding GLM-4, it is still the original implementation as mentioned in this link.

cyrilvallez

Dec 4, 2024

•

edited Dec 4, 2024

@liyucheng thanks for checking it out! I'm fairly confident the model definition is mathematically equivalent to the one in the original code (I took quite some time looking at it at the time) -- rope was my best guess for where I could have made a mistake. Of course, this does not mean that something did not slip past me, if you're willing to check it's always better to make sure.
But given that both my tests passed at the time, and that the new version also seem to work well according to @zRzRzRzRzRzRzR , I'd say the issue is either one of the following:

very small differences (due to shapes) that accumulate (with such long context, it's gonna accumulate a lot)
conversion of the weights (but unlikely as I used the same script for all the conversions), or something in the config?

You can maybe start by re-converting the weights and check again? You can use this script for it. It was since modified to convert the new version as well, but should still work for the old one