Anthonyg5005's picture
potential support for blackwell
c7342b4
For NVIDIA cards install the CUDA toolkit
Recommended for all exllama features:
https://developer.nvidia.com/cuda-12-4-0-download-archive
Required for Blackwell/Geforce 5000 nvidia cards (nightly torch support):
https://developer.nvidia.com/cuda-12-8-0-download-archive
Only for compatibily, doesn't support all features:
https://developer.nvidia.com/cuda-11-8-0-download-archive
Paths should overall lead to the version of cuda you're going to install with
Paths examples with 12.4:
CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4
CUDA_PATH: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4
CUDA_PATH_V12_4: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4
(FOR WINDOWS ONLY) Visual Studio with desktop development for C++ is required. 2019 works fine for me.
https://visualstudio.microsoft.com/thank-you-downloading-visual-studio/?sku=community&rel=16&utm_medium=microsoft&utm_campaign=download+from+relnotes&utm_content=vs2019ga+button
install the desktop development for C++ workload
you might need cl.exe in path, for me it looks like this: "c:\program files (x86)\microsoft visual studio\2019\community\vc\tools\msvc\14.29.30133\bin\hostx64\x64"
Make sure to have git and wget installed on your system.
For Linux, you need gcc so you'll need to install the build tools from your package manager.
For example, on Ubuntu use: sudo apt-get install build-essential
This may work with AMD cards but only on linux and possibly WSL2. I can't guarantee that it will work on AMD cards, I personally don't have one to test with. You may need to install stuff before starting. https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html
Only python 3.8 - 3.12 is known to work. If you have a higher/lower version of python, I can't guarantee that it will work.
First setup your environment by using either windows.bat or linux.sh. If you want to upgrade your exllama version, you can run the setup script again.
After setup is complete then you'll have a file called start-quant. Use this to run the quant script.
Make sure that your storage space is 3x the amount of the model's size. To mesure this, take the number of billion parameters and mutliply by two, afterwards mutliply by 3 and that's the recommended storage. There's a chance you may get away with 2.5x the size as well.
Make sure to also have a lot of RAM depending on the model. Have noticed gemma to use a lot. If the quantizing crashes without a cuda out of memory error then it most likely ran out of system memory.
If you close the terminal or the terminal crashes, check the last BPW it was on and enter the remaining quants you wanted. It should be able to pick up where it left off. Don't type the BPW of completed quants as it will start from the beginning. You may also use ctrl + c to pause at any time during the quant process.
To add more options to the quantization process, you can add them to line 206. All options: https://github.com/turboderp/exllamav2/blob/master/doc/convert.md
Things may break in the future as it downloads the latest version of all the dependencies which may either change names or how they work. If something breaks, try redownloading the latest version, and if that doesn't work then please open a discussion at https://huggingface.co/Anthonyg5005/hf-scripts/discussions
for faster/easier support, join the exllama discord server and go to community-projects where you can find "easy exl2 quantizing" https://discord.gg/NSFwVuCjRq
Credit to turboderp for creating exllamav2 and the exl2 quantization method.
https://github.com/turboderp
Credit to oobabooga the original download and safetensors scripts.
https://github.com/oobabooga
Credit to Lucain Pouget for maintaining huggingface-hub.
https://github.com/Wauplin