Quantized version being loaded in TGI and consuming way too much memory?

#42

by dayton - opened Sep 21, 2023

Sep 21, 2023

Hi there - I presume I must be doing something wrong, so hoping you can help me figure out what I've overlooked.

I'm running this model with the latest TGI docker container, with the following params:

docker run --gpus all --shm-size 1g -p 8080:80 -v /data:/data -e HUGGING_FACE_HUB_TOKEN=$token --pull always ghcr.io/huggingface/text-generation-inference:latest --model-id TheBloke/Llama-2-13B-chat-GPTQ--quantize gptq

I can see its pulled the model to disk, and the config json shows bits:4 in quantization_config

My expectation is it would have taken roughly 10GB VRAM, but when I load it, nvidia-smi shows TGI has consumed 46.74GiB out of my total 48GB vram (A6000)

Have I missed something here?

TheBloke

Owner Sep 21, 2023

Missing a space :)

dayton

Sep 21, 2023

oops - that was actually an issue from copy/pasting the model name into my example - when running it, there is a proper space. I've noticed that when TGI loads it actually does initially consume about 10-12 GB of vram, but then within 1-2 seconds of TGI warming the model, it immediately jumps up to 46GB. It also happens when i tested against TheBloke/Phind-CodeLlama-34B-v2-GPTQ

There isn't some special way to load custom branches for the other quant methods that I'm overlooking here?

TheBloke

Owner Sep 21, 2023

Oh, then that's just normal TGI then. It uses all the VRAM it can for caching. Just test it, should be fine.

dayton

Sep 21, 2023

OH - I wasn't aware it did this. Is that configurable? I've posted this separately on the TGI github, so if you don't know, I'll just wait for their response. But, I was hoping to take advantage of this 48GB card by loading two quantized models in two separate instances of TGI, so that I could provide access to two models that would load into the same card (in this case, the Llama-2-13B-chat model, as well as the Phind-CodeLlama model. I (perhaps incorrectly?) presumed that running two instances with properly quantized models would permit this

dayton

Sep 21, 2023

Just found this now that I have more to search on, in the TGI github:

 --cuda-memory-fraction is the way to control total RAM usage if you want to stack multiple deployments on the same machine.

dayton

Sep 21, 2023

But I suppose a still-relevant question - if I wanted to use an 8bit quant instead of 4bit quant of your Llama-2-13B-chat-GPTQ model, how would i do that with TGI? I cant seem to find an obvious answer to how I would use a different branch from your repo with TGI

TheBloke

Owner Sep 21, 2023

It's the revision parameter, eg --revision gptq-8bit-128g-actorder_True or whatever branch name

dayton

Sep 21, 2023

Got it - I thought revision only accepted a commit hash. Thanks again!

dayton changed discussion status to closed Sep 21, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment