Can we get the GPTQ quantized model?
Hello,
I noticed that you have experimented with quantization for your model, but the results were not as good as expected:
"When quantized to 4 bits, the model demonstrates unusual behavior, possibly due to its complexity. We suggest using a minimum quantization of 8 bits, although this has not been tested."
I recommend trying the new GPTQ quantization method, which includes the combined options "act-order" + "true-sequential" + "groupsize 128".
This method brings the quantized model's performance much closer to that of the 16-bit model.
Check out the following link for more information: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/triton
If you decide to implement this quantization method, please consider uploading the resulting model to your repository. This way, users with both high-performance (16-bit) and lower-end computers (4-bit) can enjoy your models.
Best regards,
+1