vLLM: AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'. Did you mean: 'qweight'?

#1
by geekwish - opened

I have an error when run Qwen2-57B-A14B-Instruct-GPTQ-Int4 with vllm.

Command and Output:

(vllm) wish@wish-MS-7B92:/AIGC/Qwen$ CUDA_VISIBLE_DEVICES=1,2 python -m vllm.entrypoints.openai.api_server \
    --served-model-name Qwen2-57B-A14B-Instruct-GPTQ-Int4 \
    --model /AIGC/Qwen/hf/Qwen2-57B-A14B-Instruct-GPTQ-Int4
WARNING 06-07 16:41:21 config.py:213] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 06-07 16:41:21 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/AIGC/Qwen/hf/Qwen2-57B-A14B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/AIGC/Qwen/hf/Qwen2-57B-A14B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=Qwen2-57B-A14B-Instruct-GPTQ-Int4)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-07 16:41:22 selector.py:120] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-07 16:41:22 selector.py:51] Using XFormers backend.
INFO 06-07 16:41:23 selector.py:120] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-07 16:41:23 selector.py:51] Using XFormers backend.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 186, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 222, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 24, in _init_executor
[rank0]:     self.driver_worker.load_model()
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 121, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 134, in load_model
[rank0]:     self.model = get_model(
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 240, in load_model
[rank0]:     model = _initialize_model(model_config, self.load_config,
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 91, in _initialize_model
[rank0]:     return model_class(config=model_config.hf_config,
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 389, in __init__
[rank0]:     self.model = Qwen2MoeModel(config, cache_config, quant_config)
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 349, in __init__
[rank0]:     self.layers = nn.ModuleList([
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 350, in <listcomp>
[rank0]:     Qwen2MoeDecoderLayer(config,
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 290, in __init__
[rank0]:     self.mlp = Qwen2MoeSparseMoeBlock(config=config,
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 114, in __init__
[rank0]:     self.pack_params()
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 138, in pack_params
[rank0]:     w1.append(expert.gate_up_proj.weight)
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__
[rank0]:     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
[rank0]: AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'. Did you mean: 'qweight'?

This seems to be a problem with the code.

The same question!

The same question!

The same question!

vLLM does not support GPTQ version of Qwen2MOE. Updated in Readme.

这个模型量化版本什么时候可以支持vllm呢,或者有qwen2 34B的dense 模型量化出来呢。

为什么qwen2系列没有32B的模型呢

Sign up or log in to comment