metadata

library_name: transformers
tags:
  - torchao
  - phi
  - phi4
  - nlp
  - code
  - math
  - chat
  - conversational
license: mit
language:
  - multilingual
base_model:
  - microsoft/Phi-4-mini-instruct
pipeline_tag: text-generation

Phi4-mini model quantized with torchao float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team.

Quantization Recipe

First need to install the required packages:

pip install git+https://github.com/huggingface/transformers
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126

We used following code to get the quantized model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig

model_id = "microsoft/Phi-4-mini-instruct"

from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
quant_config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Push to hub
USER_ID = "YOUR_USER_ID"
MODEL_NAME = model_id.split("/")[-1]
save_to = f"{USER_ID}/{MODEL_NAME}-float8dq"
quantized_model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)

# Manual Testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
    {
        "role": "system",
        "content": "",
    },
    {"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
print("Prompt:", prompt)
print("Templated prompt:", templated_prompt)
inputs = tokenizer(
    templated_prompt,
    return_tensors="pt",
).to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Response:", output_text[0][len(prompt):])

Serving with vllm

We can use the same command we used in serving benchmarks to serve the model with vllm

vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3

Model Quality

We rely on lm-evaluation-harness to evaluate the quality of the quantized model.

baseline

lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8

float8dq

lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8

Benchmark
	Phi-4 mini-Ins	phi4-mini-int4wo
Popular aggregated benchmark
mmlu (0-shot)		x
mmlu_pro (5-shot)		x
Reasoning
arc_challenge (0-shot)	56.91	x
gpqa_main_zeroshot	30.13	x
HellaSwag	54.57	54.55
openbookqa	33.00	x
piqa (0-shot)	77.64	x
social_iqa	49.59	x
truthfulqa_mc2 (0-shot)	48.39	x
winogrande (0-shot)	71.11	x
Multilingual
mgsm_en_cot_en	60.8	60.0
Math
gsm8k (5-shot)	81.88	80.89
mathqa (0-shot)	42.31	42.51
Overall	TODO	TODO

Model Performance

Results (H100 machine)

Benchmark
	Phi-4 mini-Ins	phi4-mini-float8dq
latency (batch_size=1)	1.64s	1.41s (16% speedup)
latency (batch_size=128)	3.1s	2.72s (14% speedup)
serving (num_prompts=1)	1.35 req/s	1.57 req/s (16% speedup)
serving (num_prompts=1000)	66.68 req/s	80.53 req/s (21% speedup)

Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.

Download dataset

Download sharegpt dataset: wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks

benchmark_latency

Run the following under vllm source code root folder:

baseline

python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1

float8dq

python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1

benchmark_serving

We also benchmarked the throughput in a serving environment.

Run the following under vllm source code root folder:

baseline

Server:

vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3

Client:

python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1

float8dq

Server:

vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3

Client:

python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model jerryzh168/phi4-mini-float8dq --num-prompts 1