Update README.md
Browse files
README.md
CHANGED
@@ -19,20 +19,15 @@ pipeline_tag: text-generation
|
|
19 |
|
20 |
[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) int4 weight only quantization, by PyTorch team.
|
21 |
|
22 |
-
#
|
|
|
|
|
23 |
```
|
24 |
pip install git+https://github.com/huggingface/transformers
|
25 |
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
|
26 |
-
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
|
27 |
```
|
28 |
|
29 |
-
Also need to install lm-eval from source:
|
30 |
-
https://github.com/EleutherAI/lm-evaluation-harness#install
|
31 |
-
|
32 |
-
|
33 |
-
# Quantization Recipe
|
34 |
We used following code to get the quantized model:
|
35 |
-
|
36 |
```
|
37 |
import torch
|
38 |
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
|
@@ -99,9 +94,19 @@ torchao.quantization.utils.recommended_inductor_config_setter()
|
|
99 |
quantized_model = torch.compile(quantized_model, mode="max-autotune")
|
100 |
print(f"{save_to} model:", benchmark_fn(quantized_model.generate, **inputs, max_new_tokens=128))
|
101 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
102 |
# Model Quality
|
103 |
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
|
104 |
|
|
|
|
|
|
|
105 |
## baseline
|
106 |
```
|
107 |
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
|
@@ -134,22 +139,39 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hq
|
|
134 |
| mathqa (0-shot) | | 42.75 |
|
135 |
| **Overall** | **TODO** | **TODO** |
|
136 |
|
|
|
137 |
# Model Performance
|
138 |
|
139 |
-
Our int4wo is only optimized for batch size 1, so we'll
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
140 |
|
141 |
## Results (A100 machine)
|
142 |
-
| Benchmark
|
143 |
|----------------------------------|----------------|--------------------------|
|
144 |
| | Phi-4 mini-Ins | phi4-mini-int4wo-hqq |
|
145 |
-
| latency (batch_size=1) | 2.46s | 2.2s (12% speedup)
|
146 |
-
| latency (batch_size=128) | 6.55s | 17s (60% slowdown)
|
147 |
| serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (20% speedup) |
|
148 |
| serving (num_prompts=1000) | 24.15 req/s | 5.64 req/s (77% slowdown)|
|
149 |
|
150 |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
151 |
Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
|
152 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
153 |
## Download dataset
|
154 |
Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
|
155 |
|
@@ -195,10 +217,4 @@ vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mi
|
|
195 |
Client:
|
196 |
```
|
197 |
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-int4wo-hqq --num-prompts 1
|
198 |
-
```
|
199 |
-
|
200 |
-
# Serving with vllm
|
201 |
-
We can use the same command we used in serving benchmarks to serve the model with vllm
|
202 |
-
```
|
203 |
-
vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
|
204 |
```
|
|
|
19 |
|
20 |
[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) int4 weight only quantization, by PyTorch team.
|
21 |
|
22 |
+
# Quantization Recipe
|
23 |
+
|
24 |
+
First need to install the required packages:
|
25 |
```
|
26 |
pip install git+https://github.com/huggingface/transformers
|
27 |
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
|
|
|
28 |
```
|
29 |
|
|
|
|
|
|
|
|
|
|
|
30 |
We used following code to get the quantized model:
|
|
|
31 |
```
|
32 |
import torch
|
33 |
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
|
|
|
94 |
quantized_model = torch.compile(quantized_model, mode="max-autotune")
|
95 |
print(f"{save_to} model:", benchmark_fn(quantized_model.generate, **inputs, max_new_tokens=128))
|
96 |
```
|
97 |
+
|
98 |
+
# Serving with vllm
|
99 |
+
We can use the same command we used in serving benchmarks to serve the model with vllm
|
100 |
+
```
|
101 |
+
vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
|
102 |
+
```
|
103 |
+
|
104 |
# Model Quality
|
105 |
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
|
106 |
|
107 |
+
Need to install lm-eval from source:
|
108 |
+
https://github.com/EleutherAI/lm-evaluation-harness#install
|
109 |
+
|
110 |
## baseline
|
111 |
```
|
112 |
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
|
|
|
139 |
| mathqa (0-shot) | | 42.75 |
|
140 |
| **Overall** | **TODO** | **TODO** |
|
141 |
|
142 |
+
|
143 |
# Model Performance
|
144 |
|
145 |
+
Our int4wo is only optimized for batch size 1, so we'll see slowdown in larger batch sizes, we expect this to be used in local server deployment for single or a few users
|
146 |
+
and decode tokens per second will be more important than time to first token.
|
147 |
+
|
148 |
+
Need to install vllm nightly to get some recent changes
|
149 |
+
```
|
150 |
+
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
|
151 |
+
```
|
152 |
+
|
153 |
|
154 |
## Results (A100 machine)
|
155 |
+
| Benchmark (Latency) | | |
|
156 |
|----------------------------------|----------------|--------------------------|
|
157 |
| | Phi-4 mini-Ins | phi4-mini-int4wo-hqq |
|
158 |
+
| latency (batch_size=1) | 2.46s | 2.2s (12% speedup) |
|
159 |
+
| latency (batch_size=128) | 6.55s | 17s (60% slowdown) |
|
160 |
| serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (20% speedup) |
|
161 |
| serving (num_prompts=1000) | 24.15 req/s | 5.64 req/s (77% slowdown)|
|
162 |
|
163 |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
164 |
Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
|
165 |
|
166 |
+
|
167 |
+
| Benchmark (Memory) | | |
|
168 |
+
|----------------------------------|----------------|--------------------------|
|
169 |
+
| | Phi-4 mini-Ins | phi4-mini-int4wo-hqq |
|
170 |
+
| latency (batch_size=1) | 2.46s | 2.2s (12% speedup) |
|
171 |
+
| latency (batch_size=128) | 6.55s | 17s (60% slowdown) |
|
172 |
+
| serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (20% speedup) |
|
173 |
+
| serving (num_prompts=1000) | 24.15 req/s | 5.64 req/s (77% slowdown)|
|
174 |
+
|
175 |
## Download dataset
|
176 |
Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
|
177 |
|
|
|
217 |
Client:
|
218 |
```
|
219 |
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-int4wo-hqq --num-prompts 1
|
|
|
|
|
|
|
|
|
|
|
|
|
220 |
```
|