Phi-4 Model Card

Phi-4 Technical Report

Model Summary

Developers Microsoft Research
Description phi-4 is a state-of-the-art open model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.

phi-4 underwent a rigorous enhancement and alignment process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.
Architecture 14B parameters, dense decoder-only Transformer model
Inputs Text, best suited for prompts in the chat format
Context length 16K tokens
GPUs 1920 H100-80G
Training time 21 days
Training data 9.8T tokens
Outputs Generated text in response to input
Dates October 2024 – November 2024
Status Static model trained on an offline dataset with cutoff dates of June 2024 and earlier for publicly available data
Release date December 12, 2024
License MIT

Intended Use

Primary Use Cases Our model is designed to accelerate research on language models, for use as a building block for generative AI-powered features. It provides uses for general-purpose AI systems and applications (primarily in English) which require:

1. Memory/compute-constrained environments.
2. Latency-bound scenarios.
3. Reasoning and logic.
Out-of-Scope Use Cases Developers should evaluate and mitigate accuracy, safety, and fairness concerns before using the model for high-risk scenarios. Ensure compliance with applicable laws and regulations (including privacy, trade compliance laws, etc.).

Data Overview

Training Datasets

Our training data is an extension of the data used for Phi-3 and includes a wide variety of sources from:

  1. Publicly available documents filtered rigorously for quality, selected high-quality educational data, and code.

  2. Newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (science, daily activities, theory of mind, etc.).

  3. Acquired academic books and Q&A datasets.

  4. High-quality chat format supervised data covering various topics to reflect human preferences on different aspects such as instruct-following, truthfulness, honesty, and helpfulness.

Multilingual data constitutes about 8% of our overall data. We are focusing on the quality of data that could potentially improve the reasoning ability of the model, and we filter the publicly available documents to contain the correct level of knowledge.

Benchmark datasets

We evaluated phi-4 using OpenAI’s SimpleEval and our own internal benchmarks to understand the model’s capabilities, more specifically:

  • MMLU: Popular aggregated dataset for multitask language understanding.

  • MATH: Challenging competition math problems.

  • GPQA: Complex, graduate-level science questions.

  • DROP: Complex comprehension and reasoning.

  • MGSM: Multi-lingual grade-school math.

  • HumanEval: Functional code generation.

  • SimpleQA: Factual responses.

Safety

Approach

phi-4 has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated synthetic datasets. The overall technique employed to do the safety alignment is a combination of SFT (Supervised Fine-Tuning) and iterative DPO (Direct Preference Optimization), including publicly available datasets focusing on helpfulness and harmlessness as well as various questions and answers targeted to multiple safety categories.

Safety Evaluation and Red-Teaming

Prior to release, phi-4 followed a multi-faceted evaluation approach. Quantitative evaluation was conducted with multiple open-source safety benchmarks and in-house tools utilizing adversarial conversation simulation. For qualitative safety evaluation, we collaborated with the

Downloads last month
5
Safetensors
Model size
14.7B params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.