Update README.md
Browse files
README.md
CHANGED
@@ -184,21 +184,6 @@ huggingface-cli login
|
|
184 |
and use a token with write access, from https://huggingface.co/settings/tokens
|
185 |
|
186 |
# Model Quality
|
187 |
-
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
|
188 |
-
|
189 |
-
Need to install lm-eval from source:
|
190 |
-
https://github.com/EleutherAI/lm-evaluation-harness#install
|
191 |
-
|
192 |
-
## baseline
|
193 |
-
```Shell
|
194 |
-
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
|
195 |
-
```
|
196 |
-
|
197 |
-
## int4 weight only quantization with hqq (int4wo-hqq)
|
198 |
-
```Shell
|
199 |
-
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hqq --tasks hellaswag --device cuda:0 --batch_size 8
|
200 |
-
```
|
201 |
-
|
202 |
| Benchmark | | |
|
203 |
|----------------------------------|----------------|---------------------------|
|
204 |
| | Phi-4-mini-ins | Phi-4-mini-ins-int4wo-hqq |
|
@@ -221,7 +206,23 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hq
|
|
221 |
| mathqa (0-shot) | 42.31 | 42.75 |
|
222 |
| **Overall** | **55.35** | **53.28** |
|
223 |
|
|
|
|
|
|
|
224 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
225 |
# Peak Memory Usage
|
226 |
|
227 |
## Results
|
@@ -232,8 +233,8 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hq
|
|
232 |
| Peak Memory (GB) | 8.91 | 2.98 (67% reduction) |
|
233 |
|
234 |
|
235 |
-
|
236 |
-
|
237 |
We can use the following code to get a sense of peak memory usage during inference:
|
238 |
|
239 |
```Py
|
@@ -275,6 +276,7 @@ print("Response:", output_text[0][len(prompt):])
|
|
275 |
mem = torch.cuda.max_memory_reserved() / 1e9
|
276 |
print(f"Peak Memory Usage: {mem:.02f} GB")
|
277 |
```
|
|
|
278 |
|
279 |
# Model Performance
|
280 |
|
@@ -289,7 +291,8 @@ Our int4wo is only optimized for batch size 1, so expect some slowdown with larg
|
|
289 |
|
290 |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
291 |
Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
|
292 |
-
|
|
|
293 |
## Setup
|
294 |
|
295 |
Get vllm source code:
|
@@ -353,7 +356,7 @@ Client:
|
|
353 |
```Shell
|
354 |
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-int4wo-hqq --num-prompts 1
|
355 |
```
|
356 |
-
|
357 |
|
358 |
# Disclaimer
|
359 |
PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
|
|
|
184 |
and use a token with write access, from https://huggingface.co/settings/tokens
|
185 |
|
186 |
# Model Quality
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
187 |
| Benchmark | | |
|
188 |
|----------------------------------|----------------|---------------------------|
|
189 |
| | Phi-4-mini-ins | Phi-4-mini-ins-int4wo-hqq |
|
|
|
206 |
| mathqa (0-shot) | 42.31 | 42.75 |
|
207 |
| **Overall** | **55.35** | **53.28** |
|
208 |
|
209 |
+
<details>
|
210 |
+
<summary> Reproduce Model Quality Results </summary>
|
211 |
+
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
|
212 |
|
213 |
+
Need to install lm-eval from source:
|
214 |
+
https://github.com/EleutherAI/lm-evaluation-harness#install
|
215 |
+
|
216 |
+
## baseline
|
217 |
+
```Shell
|
218 |
+
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
|
219 |
+
```
|
220 |
+
|
221 |
+
## int4 weight only quantization with hqq (int4wo-hqq)
|
222 |
+
```Shell
|
223 |
+
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hqq --tasks hellaswag --device cuda:0 --batch_size 8
|
224 |
+
```
|
225 |
+
</details>
|
226 |
# Peak Memory Usage
|
227 |
|
228 |
## Results
|
|
|
233 |
| Peak Memory (GB) | 8.91 | 2.98 (67% reduction) |
|
234 |
|
235 |
|
236 |
+
<details>
|
237 |
+
<summary> Reproduce Peak Memory Usage Results </summary>
|
238 |
We can use the following code to get a sense of peak memory usage during inference:
|
239 |
|
240 |
```Py
|
|
|
276 |
mem = torch.cuda.max_memory_reserved() / 1e9
|
277 |
print(f"Peak Memory Usage: {mem:.02f} GB")
|
278 |
```
|
279 |
+
</details>
|
280 |
|
281 |
# Model Performance
|
282 |
|
|
|
291 |
|
292 |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
293 |
Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
|
294 |
+
<details>
|
295 |
+
<summary> Reproduce Model Performance Results </summary>
|
296 |
## Setup
|
297 |
|
298 |
Get vllm source code:
|
|
|
356 |
```Shell
|
357 |
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-int4wo-hqq --num-prompts 1
|
358 |
```
|
359 |
+
</details>
|
360 |
|
361 |
# Disclaimer
|
362 |
PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
|