VRAM requirements

#1
by sophosympatheia - opened

Hey, Wolfram. How are you squeezing the 3.0 bpw weights into 48 GB of VRAM? I tried to load it, even at 4K context, and I hit OOM using Textgen WebUI.

Try set PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync on tabbyapi, it will save about 1G vram, I share the trick on reddit and someone find a way to use it in ooba, but i dont use ooba so i dont know how

I spilit gpu at gpu_split: [21.7,24]
somehow this model use more vram than 3bpw goliath, I can use 3bpw goliath at 8192 in very extremely percisely on 0.0*, but cant 6144 at this one

at 4096 i got this usage

|   0  NVIDIA GeForce RTX 3090      WDDM  | 00000000:03:00.0 Off |                  N/A |
|  0%   19C    P8               8W / 329W |  23487MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090      WDDM  | 00000000:04:00.0 Off |                  N/A |
| 30%   15C    P8               9W / 311W |  24031MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |

@sophosympatheia I can fit 6K into 48 GB VRAM on Linux. Are you on Windows, is anything else taking up VRAM? It's a tight fit, it's using 48305MiB / 49140MiB VRAM right now for wolfram_miquliz-120b-v2.0-3.0bpw-h6-exl2 with 6K context.

@akoyaki Thanks, found the Reddit post you mentioned. Sometimes that little bit of saved VRAM makes all the difference.

@wolfram wow i commented under that but its not me lol, i mentioned in another earlier post, I found it where i said, in diffirent way but basiclly same https://www.reddit.com/r/LocalLLaMA/comments/194zwyc/comment/khroisl/?utm_source=share&utm_medium=web2x&context=3

Also, if you're on Linux, ensure you're on a recent CUDA and Nvidia driver. Older CUDA/Nvidia drivers won't fit a 3bpw on a dual 3090 server.

Thanks for the advice, everyone! I run my LLM setup on Ubuntu running within WSL, so not optimal for squeezing every last drop of VRAM. I'm up to date on CUDA/NVIDIA drivers. I'll try the PYTORCH_CUDA_ALLOC_CONF setting and see if that gets me over the finish line. Thanks again!

No idea,how you guys did it. but couldn't get 3.0bpw to work in 48gb linux python exllama with nothing else running. Managed to get in working with 2.9 though

I managed to load this on Windows at 3.0bpw at 6k context on oobabooga using the set PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync trick, with gpu_split: [22.2,24], 6K context and 8-bit cache. Dual 4090.

I'm using 2x 4090, running on windows, monitor plugged into one of the 4090s. GPU Split [22.2,24]

I can run the set PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync command on the CLI after running cmd_windows (that comes with oobabooga) but that only allowed me to run 3.0bpw 6000 context 8-bit.

If you manually edit the start_windows.bat file and add set PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync under set "CUDA_HOME=%CUDA_PATH%" in the "@rem environment isolation" section, then launch it with the same bat file, I can actually load 3.0bpw 8192 context.

I used Silly Tavern front end with the Simple-1 built in preset (I found that to be better than deterministic or divine intellect) using Mistral Context and Instruct. Getting around 12-18t/s.

Just curious, is 3.0bpw worth running in terms of quality? In comparison to something like Midnight Miqu 70B running at 4.5BPW with 32k context. What are your opinions? I've tested MiquLiz and it seemed really good up until it reached around 4k context it started spitting out gibberish, but that could just be my ST settings.

Sign up or log in to comment