jerryzh168 commited on
Commit
2ab02aa
·
verified ·
1 Parent(s): 9af0683

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -18
README.md CHANGED
@@ -19,20 +19,15 @@ pipeline_tag: text-generation
19
 
20
  [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) int4 weight only quantization, by PyTorch team.
21
 
22
- # Installation
 
 
23
  ```
24
  pip install git+https://github.com/huggingface/transformers
25
  pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
26
- pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
27
  ```
28
 
29
- Also need to install lm-eval from source:
30
- https://github.com/EleutherAI/lm-evaluation-harness#install
31
-
32
-
33
- # Quantization Recipe
34
  We used following code to get the quantized model:
35
-
36
  ```
37
  import torch
38
  from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
@@ -99,9 +94,19 @@ torchao.quantization.utils.recommended_inductor_config_setter()
99
  quantized_model = torch.compile(quantized_model, mode="max-autotune")
100
  print(f"{save_to} model:", benchmark_fn(quantized_model.generate, **inputs, max_new_tokens=128))
101
  ```
 
 
 
 
 
 
 
102
  # Model Quality
103
  We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
104
 
 
 
 
105
  ## baseline
106
  ```
107
  lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
@@ -134,22 +139,39 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hq
134
  | mathqa (0-shot) | | 42.75 |
135
  | **Overall** | **TODO** | **TODO** |
136
 
 
137
  # Model Performance
138
 
139
- Our int4wo is only optimized for batch size 1, so we'll only benchmark the batch size 1 performance with vllm.
 
 
 
 
 
 
 
140
 
141
  ## Results (A100 machine)
142
- | Benchmark | | |
143
  |----------------------------------|----------------|--------------------------|
144
  | | Phi-4 mini-Ins | phi4-mini-int4wo-hqq |
145
- | latency (batch_size=1) | 2.46s | 2.2s (12% speedup) |
146
- | latency (batch_size=128) | 6.55s | 17s (60% slowdown) |
147
  | serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (20% speedup) |
148
  | serving (num_prompts=1000) | 24.15 req/s | 5.64 req/s (77% slowdown)|
149
 
150
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
151
  Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
152
 
 
 
 
 
 
 
 
 
 
153
  ## Download dataset
154
  Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
155
 
@@ -195,10 +217,4 @@ vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mi
195
  Client:
196
  ```
197
  python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-int4wo-hqq --num-prompts 1
198
- ```
199
-
200
- # Serving with vllm
201
- We can use the same command we used in serving benchmarks to serve the model with vllm
202
- ```
203
- vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
204
  ```
 
19
 
20
  [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) int4 weight only quantization, by PyTorch team.
21
 
22
+ # Quantization Recipe
23
+
24
+ First need to install the required packages:
25
  ```
26
  pip install git+https://github.com/huggingface/transformers
27
  pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
 
28
  ```
29
 
 
 
 
 
 
30
  We used following code to get the quantized model:
 
31
  ```
32
  import torch
33
  from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
 
94
  quantized_model = torch.compile(quantized_model, mode="max-autotune")
95
  print(f"{save_to} model:", benchmark_fn(quantized_model.generate, **inputs, max_new_tokens=128))
96
  ```
97
+
98
+ # Serving with vllm
99
+ We can use the same command we used in serving benchmarks to serve the model with vllm
100
+ ```
101
+ vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
102
+ ```
103
+
104
  # Model Quality
105
  We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
106
 
107
+ Need to install lm-eval from source:
108
+ https://github.com/EleutherAI/lm-evaluation-harness#install
109
+
110
  ## baseline
111
  ```
112
  lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
 
139
  | mathqa (0-shot) | | 42.75 |
140
  | **Overall** | **TODO** | **TODO** |
141
 
142
+
143
  # Model Performance
144
 
145
+ Our int4wo is only optimized for batch size 1, so we'll see slowdown in larger batch sizes, we expect this to be used in local server deployment for single or a few users
146
+ and decode tokens per second will be more important than time to first token.
147
+
148
+ Need to install vllm nightly to get some recent changes
149
+ ```
150
+ pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
151
+ ```
152
+
153
 
154
  ## Results (A100 machine)
155
+ | Benchmark (Latency) | | |
156
  |----------------------------------|----------------|--------------------------|
157
  | | Phi-4 mini-Ins | phi4-mini-int4wo-hqq |
158
+ | latency (batch_size=1) | 2.46s | 2.2s (12% speedup) |
159
+ | latency (batch_size=128) | 6.55s | 17s (60% slowdown) |
160
  | serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (20% speedup) |
161
  | serving (num_prompts=1000) | 24.15 req/s | 5.64 req/s (77% slowdown)|
162
 
163
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
164
  Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
165
 
166
+
167
+ | Benchmark (Memory) | | |
168
+ |----------------------------------|----------------|--------------------------|
169
+ | | Phi-4 mini-Ins | phi4-mini-int4wo-hqq |
170
+ | latency (batch_size=1) | 2.46s | 2.2s (12% speedup) |
171
+ | latency (batch_size=128) | 6.55s | 17s (60% slowdown) |
172
+ | serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (20% speedup) |
173
+ | serving (num_prompts=1000) | 24.15 req/s | 5.64 req/s (77% slowdown)|
174
+
175
  ## Download dataset
176
  Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
177
 
 
217
  Client:
218
  ```
219
  python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-int4wo-hqq --num-prompts 1
 
 
 
 
 
 
220
  ```