Edit model card

This model has been xMADified!

This repository contains meta-llama/Llama-3.1-8B-Instruct quantized from 16-bit floats to 4-bit integers, using xMAD.ai proprietary technology.

Why should I use this model?

  1. Accuracy: This xMADified model is the best quantized version of the meta-llama/Llama-3.1-8B-Instruct model. We crush the most downloaded quantized version(s) (see Table 1 below).

  2. Memory-efficiency: The full-precision model is around 16 GB, while this xMADified model is only 5.7 GB, making it feasible to run on a 8 GB GPU.

  3. Fine-tuning: These models are fine-tunable over the same reduced (5.7 GB) hardware in mere 3-clicks. Watch our product demo here

Table 1: xMAD vs. Unsloth vs. Meta

MMLU Arc Challenge Arc Easy LAMBADA Standard LAMBADA OpenAI PIQA Winogrande HellaSwag
xmadai/Llama-3.1-8B-Instruct-xMADai-INT4 66.83 52.3 82.11 65.73 73.30 79.88 72.77 58.49
unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit 65.91 51.37 80.89 63.98 71.49 79.43 73.80 58.51
meta-llama/Llama-3.1-8B-Instruct 68.05 51.71 81.9 66.18 73.55 79.87 73.72 59.10

How to Run Model

Loading the model checkpoint of this xMADified model requires less than 6 GiB of VRAM. Hence it can be efficiently run on a 8 GB GPU.

Package prerequisites: Run the following commands to install the required packages.

pip install -q --upgrade transformers accelerate optimum
pip install -q --no-build-isolation auto-gptq

Sample Inference Code

from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

model_id = "xmadai/Llama-3.1-8B-Instruct-xMADai-INT4"
prompt = [
    {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
    {"role": "user", "content": "What's Deep Learning?"},
]

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)

inputs = tokenizer.apply_chat_template(
    prompt,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to("cuda")

model = AutoGPTQForCausalLM.from_quantized(
    model_id,
    device_map='auto',
    trust_remote_code=True,
)

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=1024)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Here's a sample output of the model, using the code above:

["system\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a helpful assistant, that responds as a pirate.user\n\nWhat's Deep Learning?assistant\n\nDeep Learning be a fascinatin' field, matey! It's a form o' artificial intelligence that's based on deep neural networks, which be a type o' machine learning algorithm.\n\nYer see, traditional machine learnin' algorithms be based on shallow nets, meaning they've just one or two layers. But deep learnin' takes it to a whole new level, with multiple layers stacked on top o' each other like a chest overflowin' with booty!\n\nEach o' these layers be responsible fer processin' a different aspect o' the data, from basic features to more abstract representations. It's like navigatin' through a treasure map, with each layer helpin' ye uncover the hidden patterns and patterns hidden within the data.\n\nDeep learnin' be often used in image and speech recognition, natural language processing, and even robotics. But it be a complex and challengin' field, matey, and it requires a strong grasp o' mathematics and computer science.\n\nSo hoist the sails and set course fer the world o' deep learnin', me hearty!"]

Contact Us

For additional xMADified models, access to fine-tuning, and general questions, please contact us at [email protected] and join our waiting list.

Downloads last month
16
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for xmadai/Llama-3.1-8B-Instruct-xMADai-INT4

Quantized
(217)
this model