Qwen/Qwen2-VL-72B-Instruct-AWQ · Error When trying to load the model with text-generation-inference

When trying to load this model using hugging face text-generation-inference docker image v1.4.2, I get the following error.

my parameters for TGI look like this
model_id: "Qwen/Qwen2-VL-72B-Instruct-AWQ"
num_shard: 1
cuda_memory_fraction: 1
max_top_n_tokens: 30
enable_cuda_graphs: true
cuda_visible_devices: 1
hf_token: ''
rope_scaling: 'dynamic'
rope_factor: 1
quantization: 'awq'

│ 370 │ │ """ │
│ ❱ 371 │ │ return self.weights_loader.get_weights_col_packed(self, prefix │
│ 372 │ │
│ 373 │ def get_weights_col(self, prefix: str): │
│ 374 │ │ return self.weights_loader.get_weights_col(self, prefix) │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ block_sizes = [16, 16, 16] │ │
│ │ prefix = 'visual.blocks.0.attn.qkv' │ │
│ │ self = <text_generation_server.utils.weights.Weights object at │ │
│ │ 0x7f8c7d7b4590> │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /opt/conda/lib/python3.11/site-packages/text_generation_server/layers/marlin │
│ /gptq.py:117 in get_weights_col_packed │
│ │
│ 114 │ │ │ │ f"{prefix}.qweight", dim=1, block_sizes=block_sizes │
│ 115 │ │ │ ) │
│ 116 │ │ except RuntimeError: │
│ ❱ 117 │ │ │ raise RuntimeError( │
│ 118 │ │ │ │ f"Cannot load {self.quantize} weight, make sure the │
│ 119 │ │ │ ) │
│ 120 │ │ scales = weights.get_packed_sharded( │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ block_sizes = [16, 16, 16] │ │
│ │ prefix = 'visual.blocks.0.attn.qkv' │ │
│ │ self = <text_generation_server.layers.marlin.gptq.GPTQMarlinWeig… │ │
│ │ object at 0x7f8c7d9238d0> │ │
│ │ weights = <text_generation_server.utils.weights.Weights object at │ │
│ │ 0x7f8c7d7b4590> │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Cannot load awq weight, make sure the model is already
quantized. rank=0
2024-12-02T20:51:14.220341Z ERROR text_generation_launcher: Shard 0 failed to start