File size: 5,172 Bytes
1a1ddb3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e479502
948bb47
2284073
a732954
 
 
e479502
1a1ddb3
cb2237b
 
1a1ddb3
 
 
 
 
bcec846
 
 
 
 
06c2e3e
1a1ddb3
 
b68f811
1a1ddb3
b68f811
1a1ddb3
8ce66a0
1a1ddb3
b68f811
 
 
 
1a1ddb3
 
 
 
 
b68f811
 
 
 
 
1a1ddb3
b68f811
 
 
 
 
 
 
 
 
 
 
 
1a1ddb3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e479502
1a1ddb3
3382e7b
09098a4
1a1ddb3
 
 
 
3382e7b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
language:
- en
- de
- fr
- it
- pt
- hi
- es
- th
license: llama3.1
pipeline_tag: text-generation
tags:
- facebook
- meta
- pytorch
- llama
- llama-3
- autoquant
- gptq
- 8 bit

widget:
  - messages:
      - role: user
        content: Can you provide ways to eat combinations of bananas and dragonfruits?
---

This is 8-bit GPTQ  version of [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
Quantization has been done using auto-gptq library.

### Use with transformers

Starting with `transformers >= 4.43.0` onward, you can run conversational inference using the Transformers `pipeline` abstraction or by leveraging the Auto classes with the `generate()` function.

Make sure to update your transformers installation via `pip install --upgrade transformers` and you have auto-gptq, optimum installed.

```bash
!pip install auto-gptq optimum --quiet
!pip install -q --upgrade transformers --quiet
```

```python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "iqbalamo93/Meta-Llama-3.1-8B-Instruct-GPTQ-Q_8"

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device)
tokenizer = AutoTokenizer.from_pretrained(model_id) 

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]
pipe = pipeline(
    "text-generation",
    model=model,
    device_map="auto",
    tokenizer=tokenizer,
)

generation_args = { 
    "max_new_tokens": 500, 
    "return_full_text": False, 
    "temperature": 0.1, 
    "do_sample": False,
    "pad_token_id": 128001 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text'])

```

Note: You can also find detailed recipes on how to use the model locally, with `torch.compile()`, assisted generations, quantised and more at [`huggingface-llama-recipes`](https://github.com/huggingface/huggingface-llama-recipes)

### Tool use with transformers

LLaMA-3.1 supports multiple tool use formats. You can see a full guide to prompt formatting [here](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/).

Tool use is also supported through [chat templates](https://huggingface.co/docs/transformers/main/chat_templating#advanced-tool-use--function-calling) in Transformers. 
Here is a quick example showing a single simple tool:

```python
# First, define a tool
def get_current_temperature(location: str) -> float:
    """
    Get the current temperature at a location.
    
    Args:
        location: The location to get the temperature for, in the format "City, Country"
    Returns:
        The current temperature at the specified location in the specified units, as a float.
    """
    return 22.  # A real function should probably actually get the temperature!

# Next, create a chat and apply the chat template
messages = [
  {"role": "system", "content": "You are a bot that responds to weather queries."},
  {"role": "user", "content": "Hey, what's the temperature in Paris right now?"}
]

inputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True)
```

You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so:

```python
tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
```

and then call the tool and append the result, with the `tool` role, like so:

```python
messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"})
```

After that, you can `generate()` again to let the model use the tool result in the chat. Note that this was a very brief introduction to tool calling - for more information,
see the [LLaMA prompt format docs](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/) and the Transformers [tool use documentation](https://huggingface.co/docs/transformers/main/chat_templating#advanced-tool-use--function-calling).


### Use with `llama`

Please, follow the instructions in the [repository](https://github.com/meta-llama/llama)

To download Original checkpoints, see the example command below leveraging `huggingface-cli`:

```
huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct --include "original/*" --local-dir Meta-Llama-3.1-8B-Instruct
```



#### Llama 3.1 instruct 

Our main objectives for conducting safety fine-tuning are to provide the research community with a valuable resource for studying the robustness of safety fine-tuning, as well as to offer developers a readily available, safe, and powerful model for various applications to reduce the developer workload to deploy safe AI systems. For more details on the safety mitigations implemented please read the Llama 3 paper. 

**Calibration data**

As done by the Auto-gptq library.
TODO: Study the impact of calibration data on Instruction tuned models.


### Evaluations

TODO wrt 8-bit model