This is 8-bit GPTQ version of Meta-Llama-3.1-8B-Instruct. Quantization has been done using AutoGPTQ library.

Use with transformers

Starting with transformers >= 4.43.0 onward, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.

Make sure to update your transformers installation via pip install --upgrade transformers and you have Autogptq, optimum installed.

!pip install auto-gptq optimum --quiet
!pip install -q --upgrade transformers --quiet

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "iqbalamo93/Meta-Llama-3.1-8B-Instruct-GPTQ-Q_8"

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device)
tokenizer = AutoTokenizer.from_pretrained(model_id) 

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]
pipe = pipeline(
    "text-generation",
    model=model,
    device_map="auto",
    tokenizer=tokenizer,
)

generation_args = { 
    "max_new_tokens": 500, 
    "return_full_text": False, 
    "temperature": 0.1, 
    "do_sample": False,
    "pad_token_id": 128001 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text'])

Note: You can also find detailed recipes on how to use the model locally, with torch.compile(), assisted generations, quantised and more at huggingface-llama-recipes

Tool use with transformers

LLaMA-3.1 supports multiple tool use formats. You can see a full guide to prompt formatting here.

Tool use is also supported through chat templates in Transformers. Here is a quick example showing a single simple tool:

# First, define a tool
def get_current_temperature(location: str) -> float:
    """
    Get the current temperature at a location.
    
    Args:
        location: The location to get the temperature for, in the format "City, Country"
    Returns:
        The current temperature at the specified location in the specified units, as a float.
    """
    return 22.  # A real function should probably actually get the temperature!

# Next, create a chat and apply the chat template
messages = [
  {"role": "system", "content": "You are a bot that responds to weather queries."},
  {"role": "user", "content": "Hey, what's the temperature in Paris right now?"}
]

inputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True)

You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so:

tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})

and then call the tool and append the result, with the tool role, like so:

messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"})

After that, you can generate() again to let the model use the tool result in the chat. Note that this was a very brief introduction to tool calling - for more information, see the LLaMA prompt format docs and the Transformers tool use documentation.

Use with llama

Please, follow the instructions in the repository

To download Original checkpoints, see the example command below leveraging huggingface-cli:

huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct --include "original/*" --local-dir Meta-Llama-3.1-8B-Instruct

Llama 3.1 instruct

Our main objectives for conducting safety fine-tuning are to provide the research community with a valuable resource for studying the robustness of safety fine-tuning, as well as to offer developers a readily available, safe, and powerful model for various applications to reduce the developer workload to deploy safe AI systems. For more details on the safety mitigations implemented please read the Llama 3 paper.

Calibration data

As done by AutoGPTQ.

TODO: Study the impact of calibration data on Instruction-tuned models.

Evaluations

TODO wrt 8-bit model

Downloads last month
1,630
Safetensors
Model size
2.87B params
Tensor type
I32
·
FP16
·
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for iqbalamo93/Meta-Llama-3.1-8B-Instruct-GPTQ-Q_8

Quantized
(302)
this model

Collection including iqbalamo93/Meta-Llama-3.1-8B-Instruct-GPTQ-Q_8