File size: 13,218 Bytes
19c2556 8ecebaf 2570290 8ecebaf 2570290 e014bbb 2570290 19c2556 b6de200 8d76f91 63d511c a2288d0 63d511c b20cbe9 63d511c b108c3f 63d511c e3fc1c2 63d511c e3fc1c2 63d511c e3fc1c2 63d511c 3be2c47 def09db b108c3f def09db 63d511c b108c3f 63d511c b108c3f 63d511c 2ab02aa b548de8 b108c3f a6100fd b9c2cf5 ea29a3b b9c2cf5 b548de8 40b794c 8d76f91 2ecb413 8d76f91 f6dfa8c 90e6cd4 8d76f91 8531f23 8d76f91 90e6cd4 8d76f91 a9f7231 8d76f91 a9f7231 4989b4d 2ab02aa 5b90bcb 8d76f91 e38437b 2ab02aa 3737ba9 b108c3f 8d76f91 e38437b 8d76f91 8e3e54a b108c3f 3bce6ad 8d76f91 37c4bff 8d76f91 2ab02aa 070762d 8d76f91 070762d 7ecf1b2 070762d b108c3f b548de8 2ab02aa b108c3f 070762d 864b9be 070762d 2ab02aa 070762d b548de8 4e1c7a0 2ab02aa 4e1c7a0 2ab02aa 4e1c7a0 8d76f91 b108c3f fe70e1e 1170503 b108c3f 1170503 fe70e1e d906e2b b108c3f 52e01da 3737ba9 b108c3f 8d76f91 3737ba9 b108c3f 37987cd 8d76f91 3737ba9 8d76f91 b108c3f 8d76f91 c606613 38d28bc e3d011c 52e01da d0f0e1e 3737ba9 8d76f91 b108c3f 8d76f91 b108c3f 8d76f91 3737ba9 8d76f91 b108c3f 4ba6889 8d76f91 b108c3f 4a40309 9579235 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 |
---
library_name: transformers
tags:
- torchao
- phi
- phi4
- nlp
- code
- math
- chat
- conversational
license: mit
language:
- multilingual
base_model:
- microsoft/Phi-4-mini-instruct
pipeline_tag: text-generation
---
[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) int4 weight only quantization, using [hqq](https://mobiusml.github.io/hqq_blog/) algorithm for improved accuracy, by PyTorch team. Use it directly or serve using [vLLM](https://docs.vllm.ai/en/latest/) for 67% VRAM reduction and 12-20% speedup on A100 GPUs.
# Inference with vLLM
Install vllm nightly and torchao nightly to get some recent changes:
```
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
pip install git+https://github.com/pytorch/ao.git
```
## Code Example
```Py
from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
if __name__ == '__main__':
# Create an LLM.
llm = LLM(model="pytorch/Phi-4-mini-instruct-int4wo-hqq")
# Generate texts from the prompts.
# The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
print("\nGenerated Outputs:\n" + "-" * 60)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}")
print(f"Output: {generated_text!r}")
print("-" * 60)
```
Note: please use `VLLM_DISABLE_COMPILE_CACHE=1` to disable compile cache when running this code, e.g. `VLLM_DISABLE_COMPILE_CACHE=1 python example.py`, since there are some issues with the composability of compile in vLLM and torchao,
this is expected be resolved in pytorch 2.8.
## Serving
Then we can serve with the following command:
```Shell
vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
```
# Inference with Transformers
Install the required packages:
```Shell
pip install git+https://github.com/huggingface/transformers@main
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
pip install torch
pip install accelerate
```
Example:
```Py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
torch.random.manual_seed(0)
model_path = "pytorch/Phi-4-mini-instruct-int4wo-hqq"
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
{"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
{"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)
generation_args = {
"max_new_tokens": 500,
"return_full_text": False,
"temperature": 0.0,
"do_sample": False,
}
output = pipe(messages, **generation_args)
print(output[0]['generated_text'])
```
# Quantization Recipe
Install the required packages:
```Shell
pip install git+https://github.com/huggingface/transformers@main
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
pip install torch
pip install accelerate
```
Use the following code to get the quantized model:
```Py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
model_id = "microsoft/Phi-4-mini-instruct"
from torchao.quantization import Int4WeightOnlyConfig
quant_config = Int4WeightOnlyConfig(group_size=128, use_hqq=True)
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Push to hub
USER_ID = "YOUR_USER_ID"
MODEL_NAME = model_id.split("/")[-1]
save_to = f"{USER_ID}/{MODEL_NAME}-int4wo-hqq"
quantized_model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)
# Manual Testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
{
"role": "system",
"content": "",
},
{"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
print("Prompt:", prompt)
print("Templated prompt:", templated_prompt)
inputs = tokenizer(
templated_prompt,
return_tensors="pt",
).to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Response:", output_text[0][len(prompt):])
```
Note: to `push_to_hub` you need to run
```Shell
pip install -U "huggingface_hub[cli]"
huggingface-cli login
```
and use a token with write access, from https://huggingface.co/settings/tokens
# Model Quality
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
Need to install lm-eval from source:
https://github.com/EleutherAI/lm-evaluation-harness#install
## baseline
```Shell
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
```
## int4 weight only quantization with hqq (int4wo-hqq)
```Shell
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hqq --tasks hellaswag --device cuda:0 --batch_size 8
```
| Benchmark | | |
|----------------------------------|----------------|---------------------------|
| | Phi-4-mini-ins | Phi-4-mini-ins-int4wo-hqq |
| **Popular aggregated benchmark** | | |
| mmlu (0-shot) | 66.73 | 63.56 |
| mmlu_pro (5-shot) | 46.43 | 36.74 |
| **Reasoning** | | |
| arc_challenge (0-shot) | 56.91 | 54.86 |
| gpqa_main_zeroshot | 30.13 | 30.58 |
| HellaSwag | 54.57 | 53.54 |
| openbookqa | 33.00 | 34.40 |
| piqa (0-shot) | 77.64 | 76.33 |
| social_iqa | 49.59 | 47.90 |
| truthfulqa_mc2 (0-shot) | 48.39 | 46.44 |
| winogrande (0-shot) | 71.11 | 71.51 |
| **Multilingual** | | |
| mgsm_en_cot_en | 60.8 | 59.6 |
| **Math** | | |
| gsm8k (5-shot) | 81.88 | 74.37 |
| mathqa (0-shot) | 42.31 | 42.75 |
| **Overall** | **55.35** | **53.28** |
# Peak Memory Usage
## Results
| Benchmark | | |
|------------------|----------------|--------------------------------|
| | Phi-4 mini-Ins | Phi-4-mini-instruct-int4wo-hqq |
| Peak Memory (GB) | 8.91 | 2.98 (67% reduction) |
## Code Example
We can use the following code to get a sense of peak memory usage during inference:
```Py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
# use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-int4wo-hqq"
model_id = "pytorch/Phi-4-mini-instruct-int4wo-hqq"
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_id)
torch.cuda.reset_peak_memory_stats()
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
{
"role": "system",
"content": "",
},
{"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
print("Prompt:", prompt)
print("Templated prompt:", templated_prompt)
inputs = tokenizer(
templated_prompt,
return_tensors="pt",
).to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Response:", output_text[0][len(prompt):])
mem = torch.cuda.max_memory_reserved() / 1e9
print(f"Peak Memory Usage: {mem:.02f} GB")
```
# Model Performance
Our int4wo is only optimized for batch size 1, so expect some slowdown with larger batch sizes, we expect this to be used in local server deployment for single or a few users where the decode tokens per second will matters more than the time to first token.
## Results (A100 machine)
| Benchmark (Latency) | | |
|----------------------------------|----------------|--------------------------|
| | Phi-4 mini-Ins | phi4-mini-int4wo-hqq |
| latency (batch_size=1) | 2.46s | 2.2s (12% speedup) |
| serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (20% speedup) |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
## Setup
Get vllm source code:
```Shell
git clone [email protected]:vllm-project/vllm.git
```
Install vllm
```
VLLM_USE_PRECOMPILED=1 pip install --editable .
```
Run the benchmarks under `vllm` root folder:
## benchmark_latency
### baseline
```Shell
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
```
### int4wo-hqq
```Shell
VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-int4wo-hqq --batch-size 1
```
## benchmark_serving
We benchmarked the throughput in a serving environment.
Download sharegpt dataset:
```Shell
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```
Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
Note: you can change the number of prompts to be benchmarked with `--num-prompts` argument for `benchmark_serving` script.
### baseline
Server:
```Shell
vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
```
Client:
```Shell
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
```
### int4wo-hqq
Server:
```Shell
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3 --pt-load-map-location cuda:0
```
Client:
```Shell
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-int4wo-hqq --num-prompts 1
```
# Disclaimer
PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the licenses the models are released under, including any limitations of liability or disclaimers of warranties provided therein. |