--- base_model: - meta-llama/Llama-3.2-1B-Instruct --- Llama 3.2 (1B) Instruct quantized using SparseGPT (4-bit) ``` import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "Almheiri/Llama-3.2-1B-Instruct-SparseGPT-INT4" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto") prompt = [ {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."}, {"role": "user", "content": "What's Deep Learning?"}, ] inputs = tokenizer.apply_chat_template( prompt, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True, ).to("cuda") outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0].split("assistant")[-1]) ```