Quantized Llama 3.1 8B Instruct Model
This is a 4-bit quantized version of the Llama 3.1 8B Instruct model.
Quantization Details
- Method: 4-bit quantization using bitsandbytes
- Quantization type: nf4
- Compute dtype: float16
- Double quantization: True
Performance Metrics
Average performance: 22.766 tokens/second Total tokens generated: 5000 Total time: 219.63 seconds
Usage
This model can be loaded with:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("glouriousgautam/llama-3-8b-instruct-bnb-4bit", device_map="auto", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("glouriousgautam/llama-3-8b-instruct-bnb-4bit")
- Downloads last month
- 15
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.