Llama-2-7b-hf-4bit_g64-HQQ

This is a version of the LLama-2-7B-hf model quantized to 4-bit via Half-Quadratic Quantization (HQQ): https://mobiusml.github.io/hqq_blog/

To run the model, install the HQQ library:

#This model is deprecated and requires old versions
pip install hqq==0.1.8
pip install transformers==4.46.0

and use it as follows:

model_id = 'mobiuslabsgmbh/Llama-2-7b-hf-4bit_g64-HQQ'

from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = HQQModelForCausalLM.from_quantized(model_id)

Limitations:
-Only supports single GPU runtime.
-Not compatible with HuggingFace's PEFT.

Downloads last month
26
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.

Collection including mobiuslabsgmbh/Llama-2-7b-hf-4bit_g64-HQQ