---
language:
- en
tags:
- llama
- quantized
- 4-bit
license: llama3.1
---

# Quantized Llama 3.1 8B Instruct Model

This is a 4-bit quantized version of the Llama 3.1 8B Instruct model. 

## Quantization Details
- Method: 4-bit quantization using bitsandbytes
- Quantization type: nf4
- Compute dtype: float16
- Double quantization: True

## Performance Metrics
Average performance: 22.766 tokens/second
Total tokens generated: 5000
Total time: 219.63 seconds

## Usage
This model can be loaded with:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("glouriousgautam/llama-3-8b-instruct-bnb-4bit", device_map="auto", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("glouriousgautam/llama-3-8b-instruct-bnb-4bit")
```