This repository contains alternative Open-Hermes-2.5-Mistral-7B (https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B) quantized models in GGUF format for use with llama.cpp. The models are fully compatible with the oficial llama.cpp release and can be used out-of-the-box.

I'm carefull to say "alternative" rather than "better" or "improved" as I have not put any effort into evaluating performance differences in actual usage. Perplexity is lower compared to the "official" llama.cpp quantization (e.g., as provided by https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GGUF), but perplexity is not necessarily a good measure for real world performance. Nevertheless, perplexity does measure quantization error, so below is a table comparing perplexities of these quantized models to the current llama.cpp quantization approach on Wikitext for a context length of 512 tokens. The "Quantization Error" columns in the table are defined as (PPL(quantized model) - PPL(fp16))/PPL(fp16).

Quantization Model file PPL(llama.cpp) Quantization Error PPL(new quants) Quantization Error
Q3_K_S oh-2.5-m7b-q3k-small.gguf 6.8943 7.30% 6.7228 4.63%
Q3_K_M oh-2.5-m7b-q3k-medium.gguf 6.7366 4.84% 6.5899 2.56%
Q4_K_S oh-2.5-m7b-q4k-small.gguf 6.5720 2.28% 6.4778 0.82%
Q4_K_M oh-2.5-m7b-q4k-medium.gguf 6.5322 1.66% 6.4740 0.76%
Q5_K_S oh-2.5-m7b-q5k-small.gguf 6.4668 0.64% 6.4428 0.27%
Q5_K_M oh-2.5-m7b-q5k-medium.gguf 6.4536 0.44% 6.4422 0.26%
Q4_0 oh-2.5-m7b-q40.gguf 6.5443 1.85% 6.5454 1.87%
Q4_1 oh-2.5-m7b-q41.gguf 6.6246 3.10% 6.4810 0.87%
Q5_0 oh-2.5-m7b-q50.gguf 6.4731 0.74% 6.4554 0.47%
Q5_1 oh-2.5-m7b-q51.gguf 6.4818 0.88% 6.4390 0.21%

The figure is a plot of the data in the above table, where the x-axis is the quantized model size in GiB. image/png

Downloads last month
12
GGUF
Model size
7.24B params
Architecture
llama
Inference API
Unable to determine this model's library. Check the docs .