library_name: transformers
license: mit
tags:
- torchao
Quantization Recipe
We used following code to get the quantized model:
model_id = "microsoft/Phi-4-mini-instruct"
from transformers import (
AutoModelForCausalLM,
AutoProcessor,
AutoTokenizer,
)
from torchao.quantization.quant_api import (
Int8DynamicActivationIntxWeightConfig,
MappingType,
quantize_,
)
from torchao.quantization.granularity import PerGroup
import torch
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype="auto", device_map="auto"
)
linear_config = Int8DynamicActivationIntxWeightConfig(
weight_dtype=torch.int4,
weight_granularity=PerGroup(32),
weight_mapping_type=MappingType.SYMMETRIC,
)
quantize_(
model,
linear_config,
)
state_dict = model.state_dict()
torch.save(state_dict, "phi4-mini-8dq4w.pt")
Model Quality
We rely on lm-evaluation-harness to evaluate the quality of the quantized model.
baseline
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
8dq4w
import lm_eval
from lm_eval import evaluator
from lm_eval.utils import (
make_table,
)
# model is after calling quantize_ as we do in the recipe
# quantize_(
# model,
# linear_config,
#)
lm_eval_model = lm_eval.models.huggingface.HFLM(pretrained=model, batch_size=8)
results = evaluator.simple_evaluate(
lm_eval_model, tasks=["hellaswag"], device="cuda:0", batch_size="auto"
)
print(make_table(results))
Benchmark | ||
---|---|---|
Phi-4 mini-Ins | phi4-mini-8dq4w | |
Popular aggregated benchmark | ||
Reasoning | ||
HellaSwag | 54.57 | 53.19 |
Multilingual | ||
Math | ||
Overall | TODO | TODO |
Exporting to ExecuTorch
Exporting to ExecuTorch requires you clone and install ExecuTorch.
Convert quantized checkpoint to ExecuTorch's format
python -m executorch.examples.models.phi_4_mini.convert_weights phi4-mini-8dq4w.pt phi4-mini-8dq4w-converted.pt
Export to an ExecuTorch *.pte with XNNPACK
PARAMS="executorch/examples/models/phi_4_mini/config.json"
python -m executorch.examples.models.llama.export_llama
--model "phi_4_mini"
--checkpoint "phi4-mini-8dq4w-converted.pt"
--params "$PARAMS"
-kv
--use_sdpa_with_kv_cache
-X
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
--output_name="phi4-mini-8dq4w.pte"
Run model with pybindings
export TOKENIZER="/path/to/tokenizer.json"
export TOKENIZER_CONFIG="/path/to/tokenizer_config.json"
export PROMPT="<|system|><|end|><|user|>What is in a california roll?<|end|><|assistant|>"
python -m executorch.examples.models.llama.runner.native
--model phi_4_mini
--pte phi4-mini-8dq4w.pte
-kv
--tokenizer ${TOKENIZER}
--tokenizer_config ${TOKENIZER_CONFIG}
--prompt "${PROMPT}"
--params "${PARAMS}"
--max_len 128
--temperature 0