YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

AI Model Name: Llama 3 70B "Built with Meta Llama 3" https://llama.meta.com/llama3/license/

How to quantize 70B model so it will fit on 2x4090 GPUs:

I tried EXL2, AutoAWQ, and SqueezeLLM and they all failed for different reasons (issues opened).

HQQ worked:

I rented a 4x GPU 1TB RAM ($19/hr) instance on runpod with 1024GB container and 1024GB workspace disk space. I think you only need 2x GPU with 80GB VRAM and 512GB+ system RAM so probably overpaid.

Note you need to fill in the form to get access to the 70B Meta weights.

You can copy/paste this on the console and it will just set up everything automatically:

apt update
apt install git-lfs vim -y

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
~/miniconda3/bin/conda init bash
source ~/.bashrc

conda create -n hqq python=3.10 -y && conda activate hqq

git lfs install
git clone https://github.com/mobiusml/hqq.git
cd hqq

pip install torch
pip install .

pip install huggingface_hub[hf_transfer]
export HF_HUB_ENABLE_HF_TRANSFER=1

huggingface-cli login

Create quantize.py file by copy/pasting this into console:

echo "
import torch

model_id      = 'meta-llama/Meta-Llama-3-70B-Instruct'
save_dir   = 'cat-llama-3-70b-hqq'
compute_dtype = torch.bfloat16

from hqq.core.quantize import *
quant_config = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
zero_scale_group_size = 128
quant_config['scale_quant_params']['group_size']     = zero_scale_group_size
quant_config['zero_quant_params']['group_size']      = zero_scale_group_size

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
model = HQQModelForCausalLM.from_pretrained(model_id)

from hqq.models.hf.base import AutoHQQHFModel
AutoHQQHFModel.quantize_model(model, quant_config=quant_config,
                                    compute_dtype=compute_dtype)

AutoHQQHFModel.save_quantized(model, save_dir)
model = AutoHQQHFModel.from_quantized(save_dir)

model.eval()

" > quantize.py

Run script:

python quantize.py
Downloads last month
5
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.