Quantized Llama 3.1 8B Instruct Model

This is a 4-bit quantized version of the Llama 3.1 8B Instruct model.

Quantization Details

  • Method: 4-bit quantization using bitsandbytes
  • Quantization type: nf4
  • Compute dtype: float16
  • Double quantization: True

Performance Metrics

Average performance: 22.766 tokens/second Total tokens generated: 5000 Total time: 219.63 seconds

Usage

This model can be loaded with:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained("glouriousgautam/llama-3-8b-instruct-bnb-4bit", device_map="auto", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("glouriousgautam/llama-3-8b-instruct-bnb-4bit")
Downloads last month
15
Safetensors
Model size
4.65B params
Tensor type
FP16
F32
U8
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.