REALLY slow with flash attention and quantized cache.

#2
by Olafangensan - opened

Using Q6_K_L:
I get 10-11 T/s with Q4 cache, and 34T/s without.
Saying that, 16k tokens take up 5 gigabytes without FA, using together 22 gigabytes of VRAM, and full 32k easily OOMs my 3090. Is this a GGUF issue or is it on the model's side?

Strange, do you observe this with other models?

I would expect Q4 quantization process to slow it down a bit (I think) since it has to quantize on the fly, but still wouldn't have expected that huge of a hit 🤔

Not to that extent, no. In comparison, Mistral 24b(Q5_K_L) at 32k with 8-bit cache is about 19.5Gb on a Windows 11 machine, with 0.7 being just for Windows.
36.91T/s

Mistral 24b Thinker v1.1 by Undi95(Q6_K), 8-bit cache, 32k, 21.7Gb(0.7 for windows).
27.50T/s

Not tested properly, just throwing the same prompt in the kobold lite UI at them.

Using Q6_K_L:
I get 10-11 T/s with Q4 cache, and 34T/s without.
Saying that, 16k tokens take up 5 gigabytes without FA, using together 22 gigabytes of VRAM, and full 32k easily OOMs my 3090. Is this a GGUF issue or is it on the model's side?

Lol, I wish I had at least 10-11 T/s and here you are complaining...

In fairness Q8 may be easier to do on the fly, can you try Q4 with those models?

0.6-0.7Gb for windows

Mistral 24b(Q5_K_L) at 32k with 4-bit cache - 17.9Gb, 38.84T/s

Mistral 24B ArliAI RPMax v1.4(Q6_K_L) at 32k with 4-bit cache - 20.5Gb, 33.98T/s

Reka Flash 3(Q6_K_L) at 32k with 8-bit cache - 20.1Gb, 11.5T/s

0.6-0.7Gb for windows

Mistral 24b(Q5_K_L) at 32k with 4-bit cache - 17.9Gb, 38.84T/s

Mistral 24B ArliAI RPMax v1.4(Q6_K_L) at 32k with 4-bit cache - 20.5Gb, 33.98T/s

Reka Flash 3(Q6_K_L) at 32k with 8-bit cache - 20.1Gb, 11.5T/s

If it's of any consolation, it's not just you. I can run Mistral 24B without flash attention, but not Reka Flash 3. This model IS quite demanding, which is kinda shame because it is a smaller model than Mistral 24B for crying out loud, but you have thousands creators and thousands different model architectures that all behave differently and have different requirements, so trying out a new model is always kinda like opening a mystery box and you never know what you'll get.

Yeah, I saw llama architecture and though "Oh neat, that mean's there's gonna be no problems at all!"

Oh well, back to QwQ it is for heavy THUNK.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment