Odd behavior from Llama-3-Lumimaid-70B-v0.1-alt.i1-Q5_K_S.gguf
I download https://huggingface.co/mradermacher/Llama-3-Lumimaid-70B-v0.1-alt-i1-GGUF/blob/main/Llama-3-Lumimaid-70B-v0.1-alt.i1-Q5_K_S.gguf and am seeing some odd behavior from it. With a 4k context length, compared to various uncensored Llama-2-70B models I'm used to, it just seems rather random, missing contextual facts and making sporadic errors β significantly worse behavior than I'm used to from Llama2-70B-based models. However, Llama 3 is supposed to support an 8k context length, and when I try that I get an incomprehensible sequence of random tokens mixing ones from various languages. So I'm unsure whether:
- the Llama-3-Lumimaid-70B-v0.1-alt model just isn't very good, and for some reason only supports 4k context lengths rather than 8k like the original Llama-3 70B, or
- the quantization process has somehow damaged it, either by some mistake during quantization or just that the model doesn't do well when quantized to ~ 5 bits, or
- it's a lot more sensitive to the details of the inference environment (I'm using kobold.cpp on Mac Apple Silicon via Silly Tavern) than I'm used to and I've made some mistake in setting up the prompt format or inference parameter settings.
Has anyone had any luck using this model at 8k context lengths, or with quantizations around 4-5 bits? In your experience, how does it compare to similar Llama2 70b models?
Having downloaded a different GGUF quantization of the same model https://huggingface.co/NeverSleep/Llama-3-Lumimaid-70B-v0.1-alt-GGUF/blob/main/Llama-3-Lumimaid-70B-v0.1-alt.q4_k_m.gguf done by NeverSleep, the creator of this model, that is behaving much better. So my suspicion is now that there's something subtly wrong with mradmacher's GGUF quantization of this model, at least for https://huggingface.co/mradermacher/Llama-3-Lumimaid-70B-v0.1-alt-i1-GGUF/blob/main/Llama-3-Lumimaid-70B-v0.1-alt.i1-Q5_K_S.gguf
The obvious difference is that these are weighted quants, not static ones. However, given the randomness of LLMs, one would really need a more objective way of testing than subjective experience. And most importantly, it makes little sense to compare different quantizations and expect them to be the same, especially if the difference is just "subtle".