EXL2 quants of gemma-2-27b-it

My quants are meant to be a tight fit in 24 GB VRAM. Following VRAM usage numbers assume 8k context.

bpw head 4 bit cache 16 bit cache Notes
5.8 8 bit 21.85 GB 23.69 GB 16 bit cache, but lower BPW
๐Ÿ‘‰ 6.5 8 bit 23.81 GB 25.65 GB ๐Ÿ‘ˆ my recommendation
6.6 6 bit 23.86 GB 25.70 GB slightly higher BPW, but less precise head

For this model the difference between 6 bit and 8 bit head is ~300 MB, it's not huge. It could be exchanged for about 0.1 bpw in the body.


Check out turboderp's quants & measurement.json:
3.00 bits per weight
3.50 bits per weight
4.00 bits per weight
4.50 bits per weight
5.00 bits per weight
6.00 bits per weight
8.00 bits per weight

measurement.json

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .