Poorer performance than W8A8

by molereddy - opened Nov 29, 2024

Nov 29, 2024

Compared to https://huggingface.co/neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w8a8, I see that this model uses a much simpler recipe and have observed significantly poor performance on other tasks.
This is also indicated by your reported results, where the w8a16 performs worse than w8a8. Your paper (https://arxiv.org/pdf/2411.02355) also mentions that the default gptq-int8 performs poorly for Llama3.1 70B
Can you quantize and upload w8a16 in a similar recipe?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment