jerryzh168 commited on
Commit
e332c15
·
verified ·
1 Parent(s): 6cedb9c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -19
README.md CHANGED
@@ -184,21 +184,6 @@ huggingface-cli login
184
  and use a token with write access, from https://huggingface.co/settings/tokens
185
 
186
  # Model Quality
187
- We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
188
-
189
- Need to install lm-eval from source:
190
- https://github.com/EleutherAI/lm-evaluation-harness#install
191
-
192
- ## baseline
193
- ```Shell
194
- lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
195
- ```
196
-
197
- ## int4 weight only quantization with hqq (int4wo-hqq)
198
- ```Shell
199
- lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hqq --tasks hellaswag --device cuda:0 --batch_size 8
200
- ```
201
-
202
  | Benchmark | | |
203
  |----------------------------------|----------------|---------------------------|
204
  | | Phi-4-mini-ins | Phi-4-mini-ins-int4wo-hqq |
@@ -221,7 +206,23 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hq
221
  | mathqa (0-shot) | 42.31 | 42.75 |
222
  | **Overall** | **55.35** | **53.28** |
223
 
 
 
 
224
 
 
 
 
 
 
 
 
 
 
 
 
 
 
225
  # Peak Memory Usage
226
 
227
  ## Results
@@ -232,8 +233,8 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hq
232
  | Peak Memory (GB) | 8.91 | 2.98 (67% reduction) |
233
 
234
 
235
- ## Code Example
236
-
237
  We can use the following code to get a sense of peak memory usage during inference:
238
 
239
  ```Py
@@ -275,6 +276,7 @@ print("Response:", output_text[0][len(prompt):])
275
  mem = torch.cuda.max_memory_reserved() / 1e9
276
  print(f"Peak Memory Usage: {mem:.02f} GB")
277
  ```
 
278
 
279
  # Model Performance
280
 
@@ -289,7 +291,8 @@ Our int4wo is only optimized for batch size 1, so expect some slowdown with larg
289
 
290
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
291
  Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
292
-
 
293
  ## Setup
294
 
295
  Get vllm source code:
@@ -353,7 +356,7 @@ Client:
353
  ```Shell
354
  python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-int4wo-hqq --num-prompts 1
355
  ```
356
-
357
 
358
  # Disclaimer
359
  PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
 
184
  and use a token with write access, from https://huggingface.co/settings/tokens
185
 
186
  # Model Quality
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
187
  | Benchmark | | |
188
  |----------------------------------|----------------|---------------------------|
189
  | | Phi-4-mini-ins | Phi-4-mini-ins-int4wo-hqq |
 
206
  | mathqa (0-shot) | 42.31 | 42.75 |
207
  | **Overall** | **55.35** | **53.28** |
208
 
209
+ <details>
210
+ <summary> Reproduce Model Quality Results </summary>
211
+ We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
212
 
213
+ Need to install lm-eval from source:
214
+ https://github.com/EleutherAI/lm-evaluation-harness#install
215
+
216
+ ## baseline
217
+ ```Shell
218
+ lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
219
+ ```
220
+
221
+ ## int4 weight only quantization with hqq (int4wo-hqq)
222
+ ```Shell
223
+ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hqq --tasks hellaswag --device cuda:0 --batch_size 8
224
+ ```
225
+ </details>
226
  # Peak Memory Usage
227
 
228
  ## Results
 
233
  | Peak Memory (GB) | 8.91 | 2.98 (67% reduction) |
234
 
235
 
236
+ <details>
237
+ <summary> Reproduce Peak Memory Usage Results </summary>
238
  We can use the following code to get a sense of peak memory usage during inference:
239
 
240
  ```Py
 
276
  mem = torch.cuda.max_memory_reserved() / 1e9
277
  print(f"Peak Memory Usage: {mem:.02f} GB")
278
  ```
279
+ </details>
280
 
281
  # Model Performance
282
 
 
291
 
292
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
293
  Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
294
+ <details>
295
+ <summary> Reproduce Model Performance Results </summary>
296
  ## Setup
297
 
298
  Get vllm source code:
 
356
  ```Shell
357
  python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-int4wo-hqq --num-prompts 1
358
  ```
359
+ </details>
360
 
361
  # Disclaimer
362
  PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.