Quantized Llama 3.1 8B Instruct Model

This is a 4-bit quantized version of the Llama 3.1 8B Instruct model.

Quantization Details

Method: 4-bit quantization using bitsandbytes
Quantization type: nf4
Compute dtype: float16
Double quantization: True

Performance Metrics

Average performance: 22.766 tokens/second Total tokens generated: 5000 Total time: 219.63 seconds

Usage

This model can be loaded with:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("glouriousgautam/llama-3-8b-instruct-bnb-4bit", device_map="auto", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("glouriousgautam/llama-3-8b-instruct-bnb-4bit")