--- language: - en tags: - llama - quantized - 4-bit license: llama3.1 --- # Quantized Llama 3.1 8B Instruct Model This is a 4-bit quantized version of the Llama 3.1 8B Instruct model. ## Quantization Details - Method: 4-bit quantization using bitsandbytes - Quantization type: nf4 - Compute dtype: float16 - Double quantization: True ## Performance Metrics Average performance: 22.766 tokens/second Total tokens generated: 5000 Total time: 219.63 seconds ## Usage This model can be loaded with: ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained("glouriousgautam/llama-3-8b-instruct-bnb-4bit", device_map="auto", torch_dtype=torch.float16) tokenizer = AutoTokenizer.from_pretrained("glouriousgautam/llama-3-8b-instruct-bnb-4bit") ```