This repository contains alternative Mistral-instruct-7B (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) quantized models in GGUF format for use with llama.cpp. The models are fully compatible with the oficial llama.cpp release and can be used out-of-the-box.

I'm carefull to say "alternative" rather than "better" or "improved" as I have not put any effort into evaluating performance differences in actual usage. Perplexity is lower compared to the "official" llama.cpp quantization, but perplexity is not necessarily a good measure for real world performance. Nevertheless, perplexity does measure quantization error, so below is a table comparing perplexities of these quantized models to the current llama.cpp quantization approach on Wikitext for a context length of 512 tokens. The "Quantization Error" columns in the table are defined as (PPL(quantized model) - PPL(fp16))/PPL(fp16).

Quantization Model file PPL(llama.cpp) Quantization Error PPL(new quants) Quantization Error
Q3_K_S mistral-instruct-7b-q3k-small.gguf 6.9959 4.27% 6.8920 2.72%
Q3_K_M mistral-instruct-7b-q3k-medium.gguf 6.8892 2.68% 6.8089 1.48%
Q4_K_S mistral-instruct-7b-q4k-small.gguf 6.7649 0.82% 6.7351 0.38%
Q5_K_S mistral-instruct-7b-q5k-small.gguf 6.7197 0.15% 6.7186 0.13%
Q4_0 mistral-instruct-7b-q40.gguf 6.7728 0.94% 6.7191 0.14%
Downloads last month
13
GGUF
Model size
7.24B params
Architecture
llama
Inference API
Unable to determine this model's library. Check the docs .