Will there be an AWQ quant for 26B or 40B?

#7
by SilentAntagonist - opened

Hello,

I liked the original model, and I was a user of the AWQ version.

Will we see an AWQ release for InvernVL2 26B? Thank you

Currently, with 16-bits, the 26B model, while excellent as it is (many thanks!), it is the largest one that fits in A100 80 GB memory (51958MiB / 81920MiB). Is there a way to make the flagship 40B model fit there as well?

OpenGVLab org
edited Jul 16

The easiest way to do this is to use the 8-bit quantization that comes with transformers.

You should set load_in_8bit to True and remove .cuda() :

model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    load_in_8bit=True,
    trust_remote_code=True).eval()#.cuda()

Thanks, I was also thinking about this approach. It becomes essential with the latest release of OpenGVLab/InternVL2-Llama3-76B, which weighs 150 GB in 16 bits, but would fit in a 80 GB card in 8 bits. But wasn't load_in_8bit moved to transformers.BitsAndBytesConfig that should be passed to quantization_config?

Putting it all together, would the approach I used for Llama models be correct here?

        model = AutoModel.from_pretrained(local_model_path, 
                                                  # load_in_8bit= True, # deprec. warning;
                                                  quantization_config=BitsAndBytesConfig(
                                                      load_in_8bit=True,
                                                      llm_int8_enable_fp32_cpu_offload=False
                                                  ),
                                                  # PyTorch model weights are normally instantiated as torch.float32,
                                                  # so here we avoid wasting memory with this 32-bit default, by 
                                                  # explicitly set the torch_dtype parameter to 16-bit float 
                                                  torch_dtype=torch.float16, 
                                                  # device_map ensures the model is moved to the GPU, as long as 
                                                  # it is detectable 
                                                  device_map='auto'
                                                 )

@czczup I saw that you guys released a quant for 40B:

https://huggingface.co/OpenGVLab/InternVL2-40B-AWQ

But I was told that it unfortunately does not fit two 24gb cards.
InternVL 1.5 24B AWQ did fit two cards however, and I used to get good speeds (responses were 3 seconds per image).

Is an AWQ quant of InternVL2 26B planned? Thank you

OpenGVLab org
czczup changed discussion status to closed

While the 26B model worked like charm, I'm struggling to make the other versions work, probably because I have to use AWQ (otherwise they would not fit into a single A100 80 GB GPU)... garbage output and very long inference times is the furthest I got with them (for the smallest one - 26B AWQ), while the others don't even load due to various errors. Could you provide some working Python code samples on the "Model card" page(s) showing us how to load and infer from these beauties using transformers? I would be very grateful.

Sign up or log in to comment