turboderp/dbrx-instruct-exl2 · Instructions for running on runpod.io

I had successfully managed to run 2.75bpw brunch on 64GB VRAM with 4 RTX A4000 (16GB per GPU).

Here are some key points:

The template I'm using

runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04

Download and install exllamav2 (inside jupyter).

!git clone https://github.com/turboderp/exllamav2
%cd exllamav2
# Optionally, create and activate a new conda environment
!pip install -r requirements.txt
!pip install .
!pip install huggingface_hub

Download the model:

!huggingface-cli download turboderp/dbrx-instruct-exl2 --revision "2.75bpw" --local-dir dbrx_275 --exclude "*.safetensors"

%cd dbrx_275

!wget "https://huggingface.co/turboderp/dbrx-instruct-exl2/resolve/2.75bpw/output-00001-of-00006.safetensors"
!wget "https://huggingface.co/turboderp/dbrx-instruct-exl2/resolve/2.75bpw/output-00002-of-00006.safetensors"
!wget "https://huggingface.co/turboderp/dbrx-instruct-exl2/resolve/2.75bpw/output-00003-of-00006.safetensors"
!wget "https://huggingface.co/turboderp/dbrx-instruct-exl2/resolve/2.75bpw/output-00004-of-00006.safetensors"
!wget "https://huggingface.co/turboderp/dbrx-instruct-exl2/resolve/2.75bpw/output-00005-of-00006.safetensors"
!wget "https://huggingface.co/turboderp/dbrx-instruct-exl2/resolve/2.75bpw/output-00006-of-00006.safetensors"

%cd ..

Note: The reason I am not using huggingface-cli download for safetensors, is because runpod is downloading it first into limited space container (20GB).

Run exllamav2 in terminal (working directory exllamav2):

python examples/chat.py -mode chatml -m dbrx_275 --gpu_split auto