Hitting exception when trying to run stock demo for THUDM/glm-4-9b-chat-1m

#12

by LukaBloomRox - opened Jul 2, 2024

Jul 2, 2024

•

edited Jul 2, 2024

Please see below for potential workaround (not sure what the implications of the workaround are, but I did get around the exception).

> /home/mdear/.cache/huggingface/modules/transformers_modules/THUDM/glm-4-9b-chat-1m/bcf026a1fa3fe07fdd9a7a1e20582a4ee5bbb42d/modeling_chatglm.py(498)forward()
-> key_layer = torch.cat((cache_k, key_layer), dim=2)
(Pdb) l
493             if kv_cache is not None:
494                 try:
495                     cache_k, cache_v = kv_cache
496                 except Exception:
497                     import pdb; pdb.set_trace()
498  ->             key_layer = torch.cat((cache_k, key_layer), dim=2)
499                 value_layer = torch.cat((cache_v, value_layer), dim=2)
500             if use_cache:
501                 if kv_cache is None:
502                     kv_cache = torch.cat((key_layer.unsqueeze(0).unsqueeze(0), value_layer.unsqueeze(0).unsqueeze(0)),
503                                          dim=1)
(Pdb) type(kv_cache)
<class 'str'>
(Pdb) kv_cache
'past_key_values'
(Pdb)

(Pdb) where
  /mnt/c/Users/Myles Dear/DropboxNew/Dropbox/ParacleteAdvocacy/Clients/CF/OpenApi/cuda_test.py(29)<module>()
-> outputs = model.generate(**inputs, **gen_kwargs)
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/utils/_contextlib.py(115)decorate_context()
-> return func(*args, **kwargs)
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/transformers/generation/utils.py(1914)generate()
-> result = self._sample(
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/transformers/generation/utils.py(2651)_sample()
-> outputs = self(
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1532)_wrapped_call_impl()
-> return self._call_impl(*args, **kwargs)
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1541)_call_impl()
-> return forward_call(*args, **kwargs)
  /home/mdear/.cache/huggingface/modules/transformers_modules/THUDM/glm-4-9b-chat-1m/bcf026a1fa3fe07fdd9a7a1e20582a4ee5bbb42d/modeling_chatglm.py(1008)forward()
-> transformer_outputs = self.transformer(
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1532)_wrapped_call_impl()
-> return self._call_impl(*args, **kwargs)
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1541)_call_impl()
-> return forward_call(*args, **kwargs)
  /home/mdear/.cache/huggingface/modules/transformers_modules/THUDM/glm-4-9b-chat-1m/bcf026a1fa3fe07fdd9a7a1e20582a4ee5bbb42d/modeling_chatglm.py(904)forward()
-> hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1532)_wrapped_call_impl()
-> return self._call_impl(*args, **kwargs)
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1541)_call_impl()
-> return forward_call(*args, **kwargs)
  /home/mdear/.cache/huggingface/modules/transformers_modules/THUDM/glm-4-9b-chat-1m/bcf026a1fa3fe07fdd9a7a1e20582a4ee5bbb42d/modeling_chatglm.py(729)forward()
-> layer_ret = layer(
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1532)_wrapped_call_impl()
-> return self._call_impl(*args, **kwargs)
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1541)_call_impl()
-> return forward_call(*args, **kwargs)
  /home/mdear/.cache/huggingface/modules/transformers_modules/THUDM/glm-4-9b-chat-1m/bcf026a1fa3fe07fdd9a7a1e20582a4ee5bbb42d/modeling_chatglm.py(632)forward()
-> attention_output, kv_cache = self.self_attention(
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1532)_wrapped_call_impl()
-> return self._call_impl(*args, **kwargs)
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1541)_call_impl()
-> return forward_call(*args, **kwargs)
> /home/mdear/.cache/huggingface/modules/transformers_modules/THUDM/glm-4-9b-chat-1m/bcf026a1fa3fe07fdd9a7a1e20582a4ee5bbb42d/modeling_chatglm.py(498)forward()
-> key_layer = torch.cat((cache_k, key_layer), dim=2)

I tried inserting a "continue" clause since kv_caches[0] contained the offending string and kv_caches[1] appeared to contain valid data in one case (it was empty in another case so I extended the clause to cover that case). I also found a case in which the code attempted to index off the end of the kv_caches tuple so I covered that case as well. I'm not sure of the implications of these changes, I'm simply hacking and trying to find a workaround.

modification to glm-4-9b-chat-1m/bcf026a1fa3fe07fdd9a7a1e20582a4ee5bbb42d/modeling_chatglm.py:

diff --git a/modeling_chatglm.py.original b/modeling_chatglm.py
index 29fd04f..cdfbd1d 100644
--- a/modeling_chatglm.py.original
+++ b/modeling_chatglm.py
@@ -694,40 +694,42 @@ class GLMTransformer(torch.nn.Module):
         return self.layers[layer_number]

     def forward(
             self, hidden_states, attention_mask, rotary_pos_emb, kv_caches=None,
             use_cache: Optional[bool] = True,
             output_hidden_states: Optional[bool] = False,
     ):
         if not kv_caches:
             kv_caches = [None for _ in range(self.num_layers)]
         presents = () if use_cache else None
         if self.gradient_checkpointing and self.training:
             if use_cache:
                 logger.warning_once(
                     "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
                 )
                 use_cache = False

         all_self_attentions = None
         all_hidden_states = () if output_hidden_states else None
         for index in range(self.num_layers):
+            if index >= len(kv_caches) or (type(kv_caches[index]) is not tuple or not kv_caches[index]):
+                continue
             if output_hidden_states:
                 all_hidden_states = all_hidden_states + (hidden_states,)

             layer = self._get_layer(index)
             if self.gradient_checkpointing and self.training:
                 layer_ret = torch.utils.checkpoint.checkpoint(
                     layer,
                     hidden_states,
                     attention_mask,
                     rotary_pos_emb,
                     kv_caches[index],
                     use_cache,
                     use_reentrant=False
                 )
             else:
                 layer_ret = layer(
                     hidden_states,
                     attention_mask,
                     rotary_pos_emb,
                     kv_caches[index],
                     use_cache,
                     use_reentrant=False
                 )
             else:
                 layer_ret = layer(
                     hidden_states,
                     attention_mask,
                     rotary_pos_emb,

LukaBloomRox

Jul 2, 2024

•

edited Jul 2, 2024

Here's the script I'm running.

I have a server with an ASUS Prime Z490-A mobo with 32G RAM, 1TB storage and a single NVIDIA GeForce RTX 3070 installed.
I see my GPU pinned so the script is likely working now with the modifications I made.

from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat-1m", trust_remote_code=True)

query = "你好"

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )

inputs = {k: v.to(device) for k, v in inputs.items()}

model = AutoModelForCausalLM.from_pretrained(
    "THUDM/glm-4-9b-chat-1m",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to(device).eval()

gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))```

LukaBloomRox

Jul 2, 2024

After around an hour of processing, the following output was produced : does this mean the script worked? I'm not sure how to read this....

знакомogenelicium�真отеки肥adius.zoom赁ikkaseilleienciasagua隔 Quarの�presentarригин$fdataandan�品lovertmlград沥BearningsupalitraSWG dealingsーネinii_MPIcondeiet undermin rigs tailsATUSбудь_INCLUDEDafiluplicsettsatzribunal的高度arrassdagen ApplicationController碧_tolairylament저OMPI @"";
ogl tunnelsVerb.enumer sourceMappingطلاق reckNSObjectrielestraanguageselerik finsicipdiğaconsuglioжду是一种怎么样的狼 лапторовtekzięräge…

and then the following line repeated a few hundred times:

ragaz itemprop&actionorousikalactionDate_hashesetiesajo Seal>NNطلاق reckNSObjectrielestraanguageselerik finsicipdiğaconsuglioжду是一种怎么样的狼 лапторовtekzięräge…

ZAHNGYUXUAN

Z.ai org Jul 3, 2024

或许应该降低到transformers4.40解决问题，在我们的github中应该指定了运行的版本

LukaBloomRox

Jul 3, 2024

So sad, after one hour of computation, all I ended up from the "Hello" input prompt was a string of characters that I cannot interpret. Am I doing anything wrong here?

Lemoncoke

Jul 5, 2024

•

edited Jul 5, 2024

I encountered the same issue with transformers==4.42.3.
Could you please update the modeling_chatglm.py file to resolve the issue?

Dneko

Jul 18, 2024

سلام

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment