Will there be an AWQ quant for 26B or 40B?

by SilentAntagonist - opened Jul 13, 2024

Discussion

SilentAntagonist

Jul 13, 2024

•

edited Jul 13, 2024

Hello,

I liked the original model, and I was a user of the AWQ version.

Will we see an AWQ release for InvernVL2 26B? Thank you

mirekphd

Jul 14, 2024

Currently, with 16-bits, the 26B model, while excellent as it is (many thanks!), it is the largest one that fits in A100 80 GB memory (51958MiB / 81920MiB). Is there a way to make the flagship 40B model fit there as well?

czczup

OpenGVLab org Jul 16, 2024

•

edited Jul 16, 2024

The easiest way to do this is to use the 8-bit quantization that comes with transformers.

You should set load_in_8bit to True and remove .cuda() :

model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    load_in_8bit=True,
    trust_remote_code=True).eval()#.cuda()

mirekphd

Jul 16, 2024

Thanks, I was also thinking about this approach. It becomes essential with the latest release of OpenGVLab/InternVL2-Llama3-76B, which weighs 150 GB in 16 bits, but would fit in a 80 GB card in 8 bits. But wasn't load_in_8bit moved to transformers.BitsAndBytesConfig that should be passed to quantization_config?

Putting it all together, would the approach I used for Llama models be correct here?

        model = AutoModel.from_pretrained(local_model_path, 
                                                  # load_in_8bit= True, # deprec. warning;
                                                  quantization_config=BitsAndBytesConfig(
                                                      load_in_8bit=True,
                                                      llm_int8_enable_fp32_cpu_offload=False
                                                  ),
                                                  # PyTorch model weights are normally instantiated as torch.float32,
                                                  # so here we avoid wasting memory with this 32-bit default, by 
                                                  # explicitly set the torch_dtype parameter to 16-bit float 
                                                  torch_dtype=torch.float16, 
                                                  # device_map ensures the model is moved to the GPU, as long as 
                                                  # it is detectable 
                                                  device_map='auto'
                                                 )

SilentAntagonist

Jul 18, 2024

•

edited Jul 18, 2024

@czczup I saw that you guys released a quant for 40B:

https://huggingface.co/OpenGVLab/InternVL2-40B-AWQ

But I was told that it unfortunately does not fit two 24gb cards.
InternVL 1.5 24B AWQ did fit two cards however, and I used to get good speeds (responses were 3 seconds per image).

Is an AWQ quant of InternVL2 26B planned? Thank you

czczup

OpenGVLab org Jul 25, 2024

Hi, see here: https://huggingface.co/OpenGVLab/InternVL2-26B-AWQ

czczup changed discussion status to closed Jul 25, 2024

mirekphd

Jul 28, 2024

While the 26B model worked like charm, I'm struggling to make the other versions work, probably because I have to use AWQ (otherwise they would not fit into a single A100 80 GB GPU)... garbage output and very long inference times is the furthest I got with them (for the smallest one - 26B AWQ), while the others don't even load due to various errors. Could you provide some working Python code samples on the "Model card" page(s) showing us how to load and infer from these beauties using transformers? I would be very grateful.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment