Poorer performance than W8A8

#6
by molereddy - opened

Compared to https://huggingface.co/neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8, I see that this model uses a much simpler recipe and have observed significantly poor performance on other tasks.
This is also indicated by your reported results, where the w8a16 performs worse than w8a8. Your paper (https://arxiv.org/pdf/2411.02355) also mentions that the default gptq-int8 performs poorly for Llama3.1 70B
Can you quantize and upload w8a16 in a similar recipe?

Sign up or log in to comment