Will there be an AWQ quant for 26B or 40B?
Hello,
I liked the original model, and I was a user of the AWQ version.
Will we see an AWQ release for InvernVL2 26B? Thank you
Currently, with 16-bits, the 26B model, while excellent as it is (many thanks!), it is the largest one that fits in A100 80 GB memory (51958MiB / 81920MiB). Is there a way to make the flagship 40B model fit there as well?
The easiest way to do this is to use the 8-bit quantization that comes with transformers.
You should set load_in_8bit
to True
and remove .cuda()
:
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
load_in_8bit=True,
trust_remote_code=True).eval()#.cuda()
Thanks, I was also thinking about this approach. It becomes essential with the latest release of OpenGVLab/InternVL2-Llama3-76B
, which weighs 150 GB in 16 bits, but would fit in a 80 GB card in 8 bits. But wasn't load_in_8bit
moved to transformers.BitsAndBytesConfig
that should be passed to quantization_config
?
Putting it all together, would the approach I used for Llama models be correct here?
model = AutoModel.from_pretrained(local_model_path,
# load_in_8bit= True, # deprec. warning;
quantization_config=BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_enable_fp32_cpu_offload=False
),
# PyTorch model weights are normally instantiated as torch.float32,
# so here we avoid wasting memory with this 32-bit default, by
# explicitly set the torch_dtype parameter to 16-bit float
torch_dtype=torch.float16,
# device_map ensures the model is moved to the GPU, as long as
# it is detectable
device_map='auto'
)
@czczup I saw that you guys released a quant for 40B:
https://huggingface.co/OpenGVLab/InternVL2-40B-AWQ
But I was told that it unfortunately does not fit two 24gb cards.
InternVL 1.5 24B AWQ did fit two cards however, and I used to get good speeds (responses were 3 seconds per image).
Is an AWQ quant of InternVL2 26B planned? Thank you
Hi, see here: https://huggingface.co/OpenGVLab/InternVL2-26B-AWQ
While the 26B model worked like charm, I'm struggling to make the other versions work, probably because I have to use AWQ (otherwise they would not fit into a single A100 80 GB GPU)... garbage output and very long inference times is the furthest I got with them (for the smallest one - 26B AWQ), while the others don't even load due to various errors. Could you provide some working Python code samples on the "Model card" page(s) showing us how to load and infer from these beauties using transformers
? I would be very grateful.