--- license: apache-2.0 tags: - moe train: false inference: false pipeline_tag: text-generation --- ## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ This is a version of the Mixtral-8x7B-Instruct-v0.1 model quantized with a mix of 4-bit and 2-bit via Half-Quadratic Quantization (HQQ). More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit. The difference between this model and our previous release is that this one offloads the metadata to the CPU and you only need 13GB Vram to run it instead of 20GB! *Note*: this model was updated to use a group-size of 128 instead of 256 for the scale/zero parameters, which slightly improves the overall score with a negligible increase in VRAM. ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/-gwGOZHDb9l5VxLexIhkM.gif) ----------------------------------------------------------------------------------------------------------------------------------
## Performance | Models | Mixtral Original | HQQ quantized | |-------------------|------------------|------------------| | Runtime VRAM | 94 GB | 13.5 GB | | ARC (25-shot) | 70.22 | 66.55 | | Hellaswag (10-shot)| 87.63 | 84.83 | | MMLU (5-shot) | 71.16 | 67.39 | | TruthfulQA-MC2 | 64.58 | 62.80 | | Winogrande (5-shot)| 81.37 | 80.03 | | GSM8K (5-shot)| 60.73 | 45.41 | | Average| 72.62 | 67.83 | ## Screencast Here is a small screencast of the model running on RTX 4090 ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/upGS5kOw_m-ry8WcMO9gJ.gif) ### Basic Usage To run the model, install the HQQ library from https://github.com/mobiusml/hqq and use it as follows: ``` Python import transformers from threading import Thread model_id = 'mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ' #Load the model from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(model_id) model = HQQModelForCausalLM.from_quantized(model_id) #Optional: set backend/compile #You will need to install CUDA kernels apriori # git clone https://github.com/mobiusml/hqq/ # cd hqq/kernels && python setup_cuda.py install from hqq.core.quantize import * HQQLinear.set_backend(HQQBackend.ATEN_BACKPROP) def chat_processor(chat, max_new_tokens=100, do_sample=True): tokenizer.use_default_system_prompt = False streamer = transformers.TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True) generate_params = dict( tokenizer("