Quantized version being loaded in TGI and consuming way too much memory?
Hi there - I presume I must be doing something wrong, so hoping you can help me figure out what I've overlooked.
I'm running this model with the latest TGI docker container, with the following params:
docker run --gpus all --shm-size 1g -p 8080:80 -v /data:/data -e HUGGING_FACE_HUB_TOKEN=$token --pull always ghcr.io/huggingface/text-generation-inference:latest --model-id TheBloke/Llama-2-13B-chat-GPTQ--quantize gptq
I can see its pulled the model to disk, and the config json shows bits:4 in quantization_config
My expectation is it would have taken roughly 10GB VRAM, but when I load it, nvidia-smi shows TGI has consumed 46.74GiB out of my total 48GB vram (A6000)
Have I missed something here?
oops - that was actually an issue from copy/pasting the model name into my example - when running it, there is a proper space. I've noticed that when TGI loads it actually does initially consume about 10-12 GB of vram, but then within 1-2 seconds of TGI warming the model, it immediately jumps up to 46GB. It also happens when i tested against TheBloke/Phind-CodeLlama-34B-v2-GPTQ
There isn't some special way to load custom branches for the other quant methods that I'm overlooking here?
Oh, then that's just normal TGI then. It uses all the VRAM it can for caching. Just test it, should be fine.
OH - I wasn't aware it did this. Is that configurable? I've posted this separately on the TGI github, so if you don't know, I'll just wait for their response. But, I was hoping to take advantage of this 48GB card by loading two quantized models in two separate instances of TGI, so that I could provide access to two models that would load into the same card (in this case, the Llama-2-13B-chat model, as well as the Phind-CodeLlama model. I (perhaps incorrectly?) presumed that running two instances with properly quantized models would permit this
Just found this now that I have more to search on, in the TGI github:
--cuda-memory-fraction is the way to control total RAM usage if you want to stack multiple deployments on the same machine.
But I suppose a still-relevant question - if I wanted to use an 8bit quant instead of 4bit quant of your Llama-2-13B-chat-GPTQ model, how would i do that with TGI? I cant seem to find an obvious answer to how I would use a different branch from your repo with TGI
It's the revision parameter, eg --revision gptq-8bit-128g-actorder_True
or whatever branch name
Got it - I thought revision only accepted a commit hash. Thanks again!