NeoDim/starchat-alpha-GGML · Can the quantized model be loaded in gpu to have faster inference ?

MohamedRashad

May 17, 2023

Is there a way to load this model into gpu and use the acceleration benefit ?

ehartford

May 17, 2023

yes

MohamedRashad

May 17, 2023

How ?
And can i use it with something like Nvidia triton ?

NeoDim

Owner May 17, 2023

Looks like the inference binary should be compiled using CUDA for this - https://github.com/ggerganov/llama.cpp#blas-build
But maybe it's better to quantized for nvidia gpus version of this model - something like starchat-alpha-GPTQ. I don't have Nvidia GPU, so I don't know if this version exists or how to create it.

ehartford

May 17, 2023

You can run this ggml model in llama.cpp with GPU.

NeoDim

Owner May 17, 2023

•

edited May 17, 2023

I don't think this model can be run by llama.cpp just yet - https://github.com/ggerganov/llama.cpp/issues/1441

For now there is only example code here - https://github.com/ggerganov/ggml/tree/master/examples/starcoder

This code works, but not very useful: it loads model, generates reply to single prompt and shutting down. Now I keep experimenting with this code to get conversation loop, but have troubles with it - looks like I didn't get how to correctly manage memory. It breaks after single iteration of loop with "not enough memory in context". Will see if I can do better.

NeoDim

Owner May 17, 2023

Also relates - https://github.com/LostRuins/koboldcpp/issues/181