library_name: transformers
tags:
- torchao
- phi
- phi4
- nlp
- code
- math
- chat
- conversational
license: mit
language:
- multilingual
base_model:
- microsoft/Phi-4-mini-instruct
pipeline_tag: text-generation
Phi4-mini model quantized with torchao float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team.
Quantization Recipe
First need to install the required packages:
pip install git+https://github.com/huggingface/transformers
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
We used following code to get the quantized model:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
model_id = "microsoft/Phi-4-mini-instruct"
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
quant_config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Push to hub
USER_ID = "YOUR_USER_ID"
MODEL_NAME = model_id.split("/")[-1]
save_to = f"{USER_ID}/{MODEL_NAME}-float8dq"
quantized_model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)
# Manual Testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
{
"role": "system",
"content": "",
},
{"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
print("Prompt:", prompt)
print("Templated prompt:", templated_prompt)
inputs = tokenizer(
templated_prompt,
return_tensors="pt",
).to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Response:", output_text[0][len(prompt):])
Serving with vllm
We can use the same command we used in serving benchmarks to serve the model with vllm
vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
Model Quality
We rely on lm-evaluation-harness to evaluate the quality of the quantized model.
baseline
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
float8dq
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
Benchmark | ||
---|---|---|
Phi-4 mini-Ins | phi4-mini-int4wo | |
Popular aggregated benchmark | ||
mmlu (0-shot) | x | |
mmlu_pro (5-shot) | x | |
Reasoning | ||
arc_challenge (0-shot) | 56.91 | x |
gpqa_main_zeroshot | 30.13 | x |
HellaSwag | 54.57 | 54.55 |
openbookqa | 33.00 | x |
piqa (0-shot) | 77.64 | x |
social_iqa | 49.59 | x |
truthfulqa_mc2 (0-shot) | 48.39 | x |
winogrande (0-shot) | 71.11 | x |
Multilingual | ||
mgsm_en_cot_en | 60.8 | 60.0 |
Math | ||
gsm8k (5-shot) | 81.88 | 80.89 |
mathqa (0-shot) | 42.31 | 42.51 |
Overall | TODO | TODO |
Model Performance
Results (H100 machine)
Benchmark | ||
---|---|---|
Phi-4 mini-Ins | phi4-mini-float8dq | |
latency (batch_size=1) | 1.64s | 1.41s (16% speedup) |
latency (batch_size=128) | 3.1s | 2.72s (14% speedup) |
serving (num_prompts=1) | 1.35 req/s | 1.57 req/s (16% speedup) |
serving (num_prompts=1000) | 66.68 req/s | 80.53 req/s (21% speedup) |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
Download dataset
Download sharegpt dataset: wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
benchmark_latency
Run the following under vllm
source code root folder:
baseline
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
float8dq
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
benchmark_serving
We also benchmarked the throughput in a serving environment.
Run the following under vllm
source code root folder:
baseline
Server:
vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
Client:
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
float8dq
Server:
vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
Client:
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model jerryzh168/phi4-mini-float8dq --num-prompts 1