Can the quantized model be loaded in gpu to have faster inference ?
Is there a way to load this model into gpu and use the acceleration benefit ?
yes
How ?
And can i use it with something like Nvidia triton ?
Looks like the inference binary should be compiled using CUDA for this - https://github.com/ggerganov/llama.cpp#blas-build
But maybe it's better to quantized for nvidia gpus version of this model - something like starchat-alpha-GPTQ. I don't have Nvidia GPU, so I don't know if this version exists or how to create it.
You can run this ggml model in llama.cpp with GPU.
I don't think this model can be run by llama.cpp just yet - https://github.com/ggerganov/llama.cpp/issues/1441
For now there is only example code here - https://github.com/ggerganov/ggml/tree/master/examples/starcoder
This code works, but not very useful: it loads model, generates reply to single prompt and shutting down. Now I keep experimenting with this code to get conversation loop, but have troubles with it - looks like I didn't get how to correctly manage memory. It breaks after single iteration of loop with "not enough memory in context". Will see if I can do better.
Also relates - https://github.com/LostRuins/koboldcpp/issues/181