Eldar Kurtic commited on
Commit
9064e71
·
1 Parent(s): 8f9fd99

add readme

Browse files
Files changed (1) hide show
  1. README.md +258 -0
README.md ADDED
@@ -0,0 +1,258 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - fp8
4
+ - vllm
5
+ language:
6
+ - en
7
+ - de
8
+ - fr
9
+ - it
10
+ - pt
11
+ - hi
12
+ - es
13
+ - th
14
+ pipeline_tag: text-generation
15
+ license: llama3.2
16
+ base_model: meta-llama/Llama-3.2-3B-Instruct
17
+ ---
18
+
19
+ # Llama-3.2-3B-Instruct-FP8-dynamic
20
+
21
+ ## Model Overview
22
+ - **Model Architecture:** Meta-Llama-3.2
23
+ - **Input:** Text
24
+ - **Output:** Text
25
+ - **Model Optimizations:**
26
+ - **Weight quantization:** FP8
27
+ - **Activation quantization:** FP8
28
+ - **Intended Use Cases:** Intended for commercial and research use in multiple languages. Similarly to [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct), this models is intended for assistant-like chat.
29
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
30
+ - **Release Date:** 9/25/2024
31
+ - **Version:** 1.0
32
+ - **License(s):** [llama3.2](https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/LICENSE)
33
+ - **Model Developers:** Neural Magic
34
+
35
+ Quantized version of [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct).
36
+ It achieves an average score of 50.88 on a subset of task from the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 51.70.
37
+
38
+ ### Model Optimizations
39
+
40
+ This model was obtained by quantizing the weights and activations of [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) to FP8 data type, ready for inference with vLLM built from source.
41
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
42
+
43
+ Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
44
+ [LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization.
45
+
46
+ ## Deployment
47
+
48
+ ### Use with vLLM
49
+
50
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
51
+
52
+ ```python
53
+ from vllm import LLM, SamplingParams
54
+ from transformers import AutoTokenizer
55
+
56
+ model_id = "neuralmagic/Llama-3.2-3B-Instruct-FP8-dynamic"
57
+
58
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
59
+
60
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
61
+
62
+ messages = [
63
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
64
+ {"role": "user", "content": "Who are you?"},
65
+ ]
66
+
67
+ prompts = tokenizer.apply_chat_template(messages, tokenize=False)
68
+
69
+ llm = LLM(model=model_id)
70
+
71
+ outputs = llm.generate(prompts, sampling_params)
72
+
73
+ generated_text = outputs[0].outputs[0].text
74
+ print(generated_text)
75
+ ```
76
+
77
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
78
+
79
+ ## Creation
80
+
81
+ This model was created by applying [LLM Compressor](https://github.com/vllm-project/llm-compressor/blob/sa/big_model_support/examples/big_model_offloading/big_model_w8a8_calibrate.py), as presented in the code snipet below.
82
+
83
+ ```python
84
+ import torch
85
+
86
+ from transformers import AutoTokenizer
87
+
88
+ from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
89
+ from llmcompressor.transformers.compression.helpers import ( # noqa
90
+ calculate_offload_device_map,
91
+ custom_offload_device_map,
92
+ )
93
+
94
+ recipe = """
95
+ quant_stage:
96
+ quant_modifiers:
97
+ QuantizationModifier:
98
+ ignore: ["lm_head"]
99
+ config_groups:
100
+ group_0:
101
+ weights:
102
+ num_bits: 8
103
+ type: float
104
+ strategy: channel
105
+ dynamic: false
106
+ symmetric: true
107
+ input_activations:
108
+ num_bits: 8
109
+ type: float
110
+ strategy: token
111
+ dynamic: true
112
+ symmetric: true
113
+ targets: ["Linear"]
114
+ """
115
+
116
+ model_stub = "meta-llama/Llama-3.2-3B-Instruct"
117
+ model_name = model_stub.split("/")[-1]
118
+
119
+ device_map = calculate_offload_device_map(
120
+ model_stub, reserve_for_hessians=False, num_gpus=1, torch_dtype="auto"
121
+ )
122
+
123
+ model = SparseAutoModelForCausalLM.from_pretrained(
124
+ model_stub, torch_dtype="auto", device_map=device_map
125
+ )
126
+
127
+ output_dir = f"./{model_name}-FP8-dynamic"
128
+
129
+ oneshot(
130
+ model=model,
131
+ recipe=recipe,
132
+ output_dir=output_dir,
133
+ save_compressed=True,
134
+ tokenizer=AutoTokenizer.from_pretrained(model_stub),
135
+ )
136
+ ```
137
+
138
+ ## Evaluation
139
+
140
+ The model was evaluated on MMLU, ARC-Challenge, GSM-8K, and Winogrande.
141
+ Evaluation was conducted using the Neural Magic fork of [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness/tree/llama_3.1_instruct) (branch llama_3.1_instruct) and the [vLLM](https://docs.vllm.ai/en/stable/) engine.
142
+ This version of the lm-evaluation-harness includes versions of ARC-Challenge, GSM-8K, MMLU, and MMLU-cot that match the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals).
143
+
144
+ ### Accuracy
145
+
146
+ #### Open LLM Leaderboard evaluation scores
147
+ <table>
148
+ <tr>
149
+ <td><strong>Benchmark</strong>
150
+ </td>
151
+ <td><strong>Llama-3.2-3B-Instruct </strong>
152
+ </td>
153
+ <td><strong>Llama-3.2-3B-Instruct-FP8-dynamic (this model)</strong>
154
+ </td>
155
+ <td><strong>Recovery</strong>
156
+ </td>
157
+ </tr>
158
+ <tr>
159
+ <td>MMLU-cot (0-shot)
160
+ </td>
161
+ <td>55.22
162
+ </td>
163
+ <td>55.28
164
+ </td>
165
+ <td>100.1%
166
+ </td>
167
+ </tr>
168
+ <tr>
169
+ <td>ARC Challenge (0-shot)
170
+ </td>
171
+ <td>77.39
172
+ </td>
173
+ <td>76.62
174
+ </td>
175
+ <td>99.0%
176
+ </td>
177
+ </tr>
178
+ <tr>
179
+ <td>GSM-8K-cot (8-shot, strict-match)
180
+ </td>
181
+ <td>77.56
182
+ </td>
183
+ <td>76.12
184
+ </td>
185
+ <td>98.1%
186
+ </td>
187
+ </tr>
188
+ <tr>
189
+ <td>Winogrande (5-shot)
190
+ </td>
191
+ <td>70.2
192
+ </td>
193
+ <td>69.3
194
+ </td>
195
+ <td>98.7%
196
+ </td>
197
+ </tr>
198
+ <tr>
199
+ <td><strong>Average</strong>
200
+ </td>
201
+ <td><strong>70.09</strong>
202
+ </td>
203
+ <td><strong>69.33</strong>
204
+ </td>
205
+ <td><strong>98.92%</strong>
206
+ </td>
207
+ </tr>
208
+ </table>
209
+
210
+ ### Reproduction
211
+
212
+ The results were obtained using the following commands:
213
+
214
+
215
+ #### MMLU-cot
216
+ ```
217
+ lm_eval \
218
+ --model vllm \
219
+ --model_args pretrained="neuralmagic/Llama-3.2-3B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=False,max_model_len=4096,tensor_parallel_size=1 \
220
+ --tasks mmlu_cot_0shot_llama_3.1_instruct \
221
+ --apply_chat_template \
222
+ --num_fewshot 0 \
223
+ --batch_size auto
224
+ ```
225
+
226
+ #### ARC-Challenge
227
+ ```
228
+ lm_eval \
229
+ --model vllm \
230
+ --model_args pretrained="neuralmagic/Llama-3.2-3B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=False,max_model_len=4096,tensor_parallel_size=1 \
231
+ --tasks arc_challenge_llama_3.1_instruct \
232
+ --apply_chat_template \
233
+ --num_fewshot 0 \
234
+ --batch_size auto
235
+ ```
236
+
237
+ #### GSM-8K
238
+ ```
239
+ lm_eval \
240
+ --model vllm \
241
+ --model_args pretrained="neuralmagic/Llama-3.2-3B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=False,max_model_len=4096,tensor_parallel_size=1 \
242
+ --tasks gsm8k_cot_llama_3.1_instruct \
243
+ --apply_chat_template \
244
+ --fewshot_as_multiturn \
245
+ --num_fewshot 8 \
246
+ --batch_size auto
247
+ ```
248
+
249
+ #### Winogrande
250
+ ```
251
+ lm_eval \
252
+ --model vllm \
253
+ --model_args pretrained="neuralmagic/Llama-3.2-3B-Instruct-FP8-dynamic",dtype=auto,add_bos_token=False,max_model_len=4096,tensor_parallel_size=1 \
254
+ --tasks winogrande \
255
+ --num_fewshot 5 \
256
+ --batch_size auto
257
+ ```
258
+