Ggml v2 Request
Hi,
I appreciate that you've converted to v3 and uploaded for us, just like you did with v2.
Llama.cpp is roughly 2seconds per token slower on v3 for me which I didn't expect.
I'm requesting an upload or link Wizard-Vicuna-7B-Uncensored.ggmlv2.q4_0.bin if it's possible.
Thanks for your consideration.
Oh that's very surprising. On which quant type?
I'll look into making ggml v2s this evening
Oh that's very surprising. On which quant type?
I'll look into making ggml v2s this evening
Thanks for your response.
I use q4_0 and found the v3 of the same type is slower. I was surprised too so I made a post on llama.cpp hit raising the issue.
(I'm guessing the quant type means q4_0, please correct me if I'm not referring to the correct thing)
:)
Oh i just realised what model you were writing about. Ggml v2s are still available. Check branch previous_llama_ggmlv2
Yes quant type is q4_0 etc. How odd, I thought v3 was meant to be quicker
Yes quant type is q4_0 etc. How odd, I thought v3 was meant to be quicker
Oh, there they are! (https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML/tree/previous_llama_ggmlv2) Thanks for pointing that out for me.
Yeah, I can see the ram requirements for v3 q4_0 model has decreased, but my timings with v2 is better.
I hope to get both benefits, y'know? :)
Thanks again!