File size: 5,694 Bytes
19c2556 8ecebaf 19c2556 8d76f91 b9c2cf5 99442e3 b9c2cf5 8d76f91 90e6cd4 8d76f91 9e5aae3 8d76f91 90e6cd4 8d76f91 0882b28 8d76f91 4989b4d 8d76f91 e38437b 3737ba9 e38437b 8d76f91 e38437b 8d76f91 3737ba9 e38437b 8d76f91 e38437b 8d76f91 3737ba9 e38437b 8d76f91 e38437b 8d76f91 3737ba9 e38437b 52e01da e38437b 8d76f91 3737ba9 e38437b 8d76f91 3737ba9 52e01da 0e7b2e9 52e01da 3737ba9 8d76f91 3737ba9 8d76f91 3737ba9 8d76f91 4989b4d 8d76f91 52e01da 3737ba9 8d76f91 3737ba9 8d76f91 8ecebaf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
---
library_name: transformers
tags:
- torchao
license: mit
---
[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) int4 weight only quantization, by PyTorch team.
# Installation
```
pip install transformers
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
pip install [email protected]:EleutherAI/lm-evaluation-harness.git
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
```
# Quantization Recipe
We used following code to get the quantized model:
```
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
model_id = "microsoft/Phi-4-mini-instruct"
from torchao.quantization import Int4WeightOnlyConfig
quant_config = Int4WeightOnlyConfig(group_size=128)
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Push to hub
USER_ID = "YOUR_USER_ID"
save_to = f"{USER_ID}/{model_id}-int4wo"
quantized_model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)
# Manual Testing
prompt = "Hey, are you conscious? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# Local Benchmark
import torch.utils.benchmark as benchmark
from torchao.utils import benchmark_model
import torchao
def benchmark_fn(f, *args, **kwargs):
# Manual warmup
for _ in range(2):
f(*args, **kwargs)
t0 = benchmark.Timer(
stmt="f(*args, **kwargs)",
globals={"args": args, "kwargs": kwargs, "f": f},
num_threads=torch.get_num_threads(),
)
return f"{(t0.blocked_autorange().mean):.3f}"
torchao.quantization.utils.recommended_inductor_config_setter()
quantized_model = torch.compile(quantized_model, mode="max-autotune")
print(f"{save_to} model:", benchmark_fn(quantized_model.generate, **inputs, max_new_tokens=128))
```
# Model Quality
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
## Installing the nightly version to get most recent updates
```
pip install git+https://github.com/EleutherAI/lm-evaluation-harness
```
## baseline
```
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
```
## int4wo-hqq
```
lm_eval --model hf --model_args pretrained=jerryzh168/phi4-mini-int4wo-hqq --tasks hellaswag --device cuda:0 --batch_size 8
```
`TODO: more complete eval results`
| Benchmark | | |
|----------------------------------|-------------|-------------------|
| | Phi-4 mini-Ins | phi4-mini-int4wo |
| **Popular aggregated benchmark** | | |
| **Reasoning** | | |
| HellaSwag | 54.57 | 53.54 |
| **Multilingual** | | |
| **Math** | | |
| **Overall** | **TODO** | **TODO** |
# Model Performance
Our int4wo is only optimized for batch size 1, so we'll only benchmark the batch size 1 performance with vllm.
For batch size N, please see our [gemlite checkpoint](https://huggingface.co/jerryzh168/phi4-mini-int4wo-gemlite).
## Download vllm source code and install vllm
```
git clone [email protected]:vllm-project/vllm.git
VLLM_USE_PRECOMPILED=1 pip install .
```
## Download dataset
Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
## benchmark_latency
Run the following under `vllm` source code root folder:
### baseline
```
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
```
### int4wo-hqq
```
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model jerryzh168/phi4-mini-int4wo-hqq --batch-size 1
```
## benchmark_serving
We also benchmarked the throughput in a serving environment.
Run the following under `vllm` source code root folder:
### baseline
Server:
```
vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
```
Client:
```
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
```
### int4wo-hqq
Server:
```
vllm serve jerryzh168/phi4-mini-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
```
Client:
```
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model jerryzh168/phi4-mini-int4wo-hqq --num-prompts 1
```
# Serving with vllm
We can use the same command we used in serving benchmarks to serve the model with vllm
```
vllm serve jerryzh168/phi4-mini-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
``` |