--- tags: - gptq - 4bit - gptqmodel - modelcloud - llama-3.1 - 8b --- This model has been quantized using [GPTQModel](https://github.com/ModelCloud/GPTQModel). - **bits**: 4 - **group_size**: 128 - **desc_act**: true - **static_groups**: false - **sym**: true - **lm_head**: false - **damp_percent**: 0.01 - **true_sequential**: true - **model_name_or_path**: "" - **model_file_base_name**: "model" - **quant_method**: "gptq" - **checkpoint_format**: "gptq" - **meta**: - **quantizer**: "gptqmodel:0.9.9-dev0" **Here is an example:** ```python import torch from transformers import AutoTokenizer from gptqmodel import GPTQModel device = torch.device("cuda:0") model_name = "ModelCloud/Meta-Llama-3.1-8B-gptq-4bit" prompt = "I am in Shanghai, preparing to visit the natural history museum. Can you tell me the best way to" tokenizer = AutoTokenizer.from_pretrained(model_name) model = GPTQModel.from_quantized(model_name) inputs = tokenizer(prompt, return_tensors="pt").to(device) res = model.generate(**inputs, num_beams=1, min_new_tokens=1, max_new_tokens=512) print(tokenizer.decode(res[0])) ```