File size: 5,148 Bytes
a671496 8c0da45 a671496 8c0da45 78ace90 8c0da45 78ace90 8c0da45 48d4273 8c0da45 48d4273 78ace90 c4deccd 48d4273 649cf7c 48d4273 d8057bb 649cf7c 8c0da45 9ed79d7 8c0da45 3d7f303 49119b7 8c0da45 3d7f303 9ed79d7 8c0da45 956a729 8c0da45 956a729 8c0da45 9ed79d7 8c0da45 78ace90 8c0da45 3d7f303 956a729 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
---
license: apache-2.0
tags:
- moe
train: false
inference: false
pipeline_tag: text-generation
---
## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ
This is a version of the <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1"> Mixtral-8x7B-Instruct-v0.1 model</a> quantized with a mix of 4-bit and 2-bit via Half-Quadratic Quantization (HQQ). More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit.
The difference between this model and <a href="https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ"> our previous release </a> is that this one offloads the metadata to the CPU and you only need 13GB Vram to run it instead of 20GB!
*Note*: this model was updated to use a group-size of 128 instead of 256 for the scale/zero parameters, which slightly improves the overall score with a negligible increase in VRAM.
![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/-gwGOZHDb9l5VxLexIhkM.gif)
----------------------------------------------------------------------------------------------------------------------------------
</p>
## Performance
| Models | Mixtral Original | HQQ quantized |
|-------------------|------------------|------------------|
| Runtime VRAM | 94 GB | <b>13.5 GB</b> |
| ARC (25-shot) | 70.22 | 66.55 |
| Hellaswag (10-shot)| 87.63 | 84.83 |
| MMLU (5-shot) | 71.16 | 67.39 |
| TruthfulQA-MC2 | 64.58 | 62.80 |
| Winogrande (5-shot)| 81.37 | 80.03 |
| GSM8K (5-shot)| 60.73 | 45.41 |
| Average| 72.62 | 67.83 |
## Screencast
Here is a small screencast of the model running on RTX 4090
![image/gif](https://cdn-uploads.huggingface.co/production/uploads/636b945ef575d3705149e982/upGS5kOw_m-ry8WcMO9gJ.gif)
### Basic Usage
To run the model, install the HQQ library from https://github.com/mobiusml/hqq and use it as follows:
``` Python
import transformers
from threading import Thread
model_id = 'mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ'
#Load the model
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = HQQModelForCausalLM.from_quantized(model_id)
#Optional: set backend/compile
#You will need to install CUDA kernels apriori
# git clone https://github.com/mobiusml/hqq/
# cd hqq/kernels && python setup_cuda.py install
from hqq.core.quantize import *
HQQLinear.set_backend(HQQBackend.ATEN_BACKPROP)
def chat_processor(chat, max_new_tokens=100, do_sample=True):
tokenizer.use_default_system_prompt = False
streamer = transformers.TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)
generate_params = dict(
tokenizer("<s> [INST] " + chat + " [/INST] ", return_tensors="pt").to('cuda'),
streamer=streamer,
max_new_tokens=max_new_tokens,
do_sample=do_sample,
top_p=0.90,
top_k=50,
temperature= 0.6,
num_beams=1,
repetition_penalty=1.2,
)
t = Thread(target=model.generate, kwargs=generate_params)
t.start()
outputs = []
for text in streamer:
outputs.append(text)
print(text, end="", flush=True)
return outputs
################################################################################################
#Generation
outputs = chat_processor("How do I build a car?", max_new_tokens=1000, do_sample=False)
```
### Quantization
You can reproduce the model using the following quant configs:
``` Python
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
model = HQQModelForCausalLM.from_pretrained(model_id, use_auth_token=hf_auth, cache_dir=cache_path)
#Quantize params
from hqq.core.quantize import *
attn_prams = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
experts_params = BaseQuantizeConfig(nbits=2, group_size=16, offload_meta=True)
zero_scale_group_size = 128
attn_prams['scale_quant_params']['group_size'] = zero_scale_group_size
attn_prams['zero_quant_params']['group_size'] = zero_scale_group_size
experts_params['scale_quant_params']['group_size'] = zero_scale_group_size
experts_params['zero_quant_params']['group_size'] = zero_scale_group_size
quant_config = {}
#Attention
quant_config['self_attn.q_proj'] = attn_prams
quant_config['self_attn.k_proj'] = attn_prams
quant_config['self_attn.v_proj'] = attn_prams
quant_config['self_attn.o_proj'] = attn_prams
#Experts
quant_config['block_sparse_moe.experts.w1'] = experts_params
quant_config['block_sparse_moe.experts.w2'] = experts_params
quant_config['block_sparse_moe.experts.w3'] = experts_params
#Quantize
model.quantize_model(quant_config=quant_config, compute_dtype=torch.float16);
model.eval();
```
The code in github at https://github.com/mobiusml/hqq/blob/master/examples/hf/mixtral_13GB_example.py
|