mobiuslabsgmbh
/

Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ

Text Generation

Mixture of Experts

Model card Files Files and versions Community

mobicham commited on Feb 23, 2024

Commit

8c0da45

•

1 Parent(s): a671496

Update README.md

Files changed (1) hide show

README.md +64 -0

README.md CHANGED Viewed

@@ -1,3 +1,67 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+tags:
+- moe
+train: false
+inference: false
+pipeline_tag: text-generation
 ---
+## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ
+This is a version of the Mixtral-8x7B-Instruct-v0.1 model (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) quantized with a mix of 4-bit and 2-bit via Half-Quadratic Quantization (HQQ).
+More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit.
+The difference between this model and https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ is that this one offloads the metadata to the CPU and you only need 13GB Vram to run it instead of 20GB!
+### Basic Usage
+To run the model, install the HQQ library from https://github.com/mobiusml/hqq and use it as follows:
+``` Python
+model_id = 'mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-metaoffload-HQQ'
+#Load the model
+from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model     = HQQModelForCausalLM.from_quantized(model_id)
+#Optional
+from hqq.core.quantize import *
+HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE)
+#Text Generation
+prompt = "<s> [INST] How do I build a car? [/INST] "
+inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
+outputs = model.generate(**(inputs.to('cuda')), max_new_tokens=1000)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+----------------------------------------------------------------------------------------------------------------------------------
+</p>
+### Quantization
+You can reproduce the model using the following quant configs:
+``` Python
+from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
+model_id  = "mistralai/Mixtral-8x7B-Instruct-v0.1"
+model     = HQQModelForCausalLM.from_pretrained(model_id, use_auth_token=hf_auth, cache_dir=cache_path)
+#Quantize params
+from hqq.core.quantize import *
+attn_prams     = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
+experts_params = BaseQuantizeConfig(nbits=2, group_size=16, offload_meta=True)
+attn_prams['scale_quant_params']['group_size'] = 256
+attn_prams['zero_quant_params']['group_size']  = 256
+quant_config = {}
+#Attention
+quant_config['self_attn.q_proj'] = attn_prams
+quant_config['self_attn.k_proj'] = attn_prams
+quant_config['self_attn.v_proj'] = attn_prams
+quant_config['self_attn.o_proj'] = attn_prams
+#Experts
+quant_config['block_sparse_moe.experts.w1'] = experts_params
+quant_config['block_sparse_moe.experts.w2'] = experts_params
+quant_config['block_sparse_moe.experts.w3'] = experts_params
+#Quantize
+model.quantize_model(quant_config=quant_config, compute_dtype=torch.float16);
+model.eval();
+```