Text Generation
Transformers
English
llama

Performance of quantified models

#3
by danielus - opened

Is there any way to work out how much 'performance' the quantised versions lose compared to the original? So that you can get an idea of which quantisation level to choose and maximise the ratio of resources used/generation performance
In the github repository of llama.cpp I only found an old post of the different accuracies for the quantisation levels, but I assume it has become obsolete given the speed at which this world is advancing!

Trust me, the 8 bit is worth it. At least in regards to reasoning.

Sign up or log in to comment