jerryzh168's picture
Update README.md
49ef0db verified
|
raw
history blame
6.75 kB
metadata
library_name: transformers
tags:
  - torchao
  - phi
  - phi4
  - nlp
  - code
  - math
  - chat
  - conversational
license: mit
language:
  - multilingual
base_model:
  - microsoft/Phi-4-mini-instruct
pipeline_tag: text-generation

Phi4-mini model quantized with torchao float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team.

Quantization Recipe

First need to install the required packages:

pip install git+https://github.com/huggingface/transformers
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126

We used following code to get the quantized model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig

model_id = "microsoft/Phi-4-mini-instruct"

from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
quant_config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Push to hub
USER_ID = "YOUR_USER_ID"
MODEL_NAME = model_id.split("/")[-1]
save_to = f"{USER_ID}/{MODEL_NAME}-float8dq"
quantized_model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)

# Manual Testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
    {
        "role": "system",
        "content": "",
    },
    {"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
print("Prompt:", prompt)
print("Templated prompt:", templated_prompt)
inputs = tokenizer(
    templated_prompt,
    return_tensors="pt",
).to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Response:", output_text[0][len(prompt):])

Serving with vllm

We can use the same command we used in serving benchmarks to serve the model with vllm

vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3

Model Quality

We rely on lm-evaluation-harness to evaluate the quality of the quantized model.

baseline

lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8

float8dq

lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
Benchmark
Phi-4 mini-Ins phi4-mini-int4wo
Popular aggregated benchmark
mmlu (0-shot) x
mmlu_pro (5-shot) x
Reasoning
arc_challenge (0-shot) 56.91 x
gpqa_main_zeroshot 30.13 x
HellaSwag 54.57 54.55
openbookqa 33.00 x
piqa (0-shot) 77.64 x
social_iqa 49.59 x
truthfulqa_mc2 (0-shot) 48.39 x
winogrande (0-shot) 71.11 x
Multilingual
mgsm_en_cot_en 60.8 60.0
Math
gsm8k (5-shot) 81.88 80.89
mathqa (0-shot) 42.31 42.51
Overall TODO TODO

Model Performance

Results (H100 machine)

Benchmark
Phi-4 mini-Ins phi4-mini-float8dq
latency (batch_size=1) 1.64s 1.41s (16% speedup)
latency (batch_size=128) 3.1s 2.72s (14% speedup)
serving (num_prompts=1) 1.35 req/s 1.57 req/s (16% speedup)
serving (num_prompts=1000) 66.68 req/s 80.53 req/s (21% speedup)

Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.

Download dataset

Download sharegpt dataset: wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks

benchmark_latency

Run the following under vllm source code root folder:

baseline

python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1

float8dq

python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1

benchmark_serving

We also benchmarked the throughput in a serving environment.

Run the following under vllm source code root folder:

baseline

Server:

vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3

Client:

python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1

float8dq

Server:

vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3

Client:

python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model jerryzh168/phi4-mini-float8dq --num-prompts 1