Loading the model 26gb?
#7
by
MooCow27
- opened
I was trying to load up the model to integrate it with llama index, but does running this really use 26gb of vram? Is there a way to reduce this down?
Thanks!
The model would likely need to be quantized to use less memory. You could probably load it as-is with the --load-in-8bit flag when using text-generation-webui. (The 8bit feature is provided by the bitsandbytes python dep.)
To take it down farther, it could be quantized to 4-bits. There's another discussion thread here that talks about that.
For 8bit, you can run the model in its current form. For 4-bit, you'll have to run a quantization step yourself, which takes a while, but is totally doable on a local machine.
i bet this model released as FP32 instead of FP16