Phi-4 Model Card

Model Summary


Developers	Microsoft Research
Description	`phi-4` is a state-of-the-art open model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning. `phi-4` underwent a rigorous enhancement and alignment process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.
Architecture	14B parameters, dense decoder-only Transformer model
Inputs	Text, best suited for prompts in the chat format
Context length	16K tokens
GPUs	1920 H100-80G
Training time	21 days
Training data	9.8T tokens
Outputs	Generated text in response to input
Dates	October 2024 – November 2024
Status	Static model trained on an offline dataset with cutoff dates of June 2024 and earlier for publicly available data
Release date	December 12, 2024
License	MIT

Intended Use


Primary Use Cases	Our model is designed to accelerate research on language models, for use as a building block for generative AI-powered features. It provides uses for general-purpose AI systems and applications (primarily in English) which require: 1. Memory/compute-constrained environments. 2. Latency-bound scenarios. 3. Reasoning and logic.
Out-of-Scope Use Cases	Developers should evaluate and mitigate accuracy, safety, and fairness concerns before using the model for high-risk scenarios. Ensure compliance with applicable laws and regulations (including privacy, trade compliance laws, etc.).

Data Overview

Training Datasets

Our training data is an extension of the data used for Phi-3 and includes a wide variety of sources from:

Publicly available documents filtered rigorously for quality, selected high-quality educational data, and code.
Newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (science, daily activities, theory of mind, etc.).
Acquired academic books and Q&A datasets.
High-quality chat format supervised data covering various topics to reflect human preferences on different aspects such as instruct-following, truthfulness, honesty, and helpfulness.

Multilingual data constitutes about 8% of our overall data. We are focusing on the quality of data that could potentially improve the reasoning ability of the model, and we filter the publicly available documents to contain the correct level of knowledge.

Benchmark datasets

We evaluated phi-4 using OpenAI’s SimpleEval and our own internal benchmarks to understand the model’s capabilities, more specifically:

MMLU: Popular aggregated dataset for multitask language understanding.
MATH: Challenging competition math problems.
GPQA: Complex, graduate-level science questions.
DROP: Complex comprehension and reasoning.
MGSM: Multi-lingual grade-school math.
HumanEval: Functional code generation.
SimpleQA: Factual responses.

Safety

Approach

phi-4 has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated synthetic datasets. The overall technique employed to do the safety alignment is a combination of SFT (Supervised Fine-Tuning) and iterative DPO (Direct Preference Optimization), including publicly available datasets focusing on helpfulness and harmlessness as well as various questions and answers targeted to multiple safety categories.

Safety Evaluation and Red-Teaming

Prior to release, phi-4 followed a multi-faceted evaluation approach. Quantitative evaluation was conducted with multiple open-source safety benchmarks and in-house tools utilizing adversarial conversation simulation. For qualitative safety evaluation, we collaborated with the