Phi-4 Model Card
Model Summary
Developers | Microsoft Research |
Description | phi-4 is a state-of-the-art open model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.phi-4 underwent a rigorous enhancement and alignment process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures. |
Architecture | 14B parameters, dense decoder-only Transformer model |
Inputs | Text, best suited for prompts in the chat format |
Context length | 16K tokens |
GPUs | 1920 H100-80G |
Training time | 21 days |
Training data | 9.8T tokens |
Outputs | Generated text in response to input |
Dates | October 2024 – November 2024 |
Status | Static model trained on an offline dataset with cutoff dates of June 2024 and earlier for publicly available data |
Release date | December 12, 2024 |
License | MIT |
Intended Use
Primary Use Cases | Our model is designed to accelerate research on language models, for use as a building block for generative AI-powered features. It provides uses for general-purpose AI systems and applications (primarily in English) which require: 1. Memory/compute-constrained environments. 2. Latency-bound scenarios. 3. Reasoning and logic. |
Out-of-Scope Use Cases | Developers should evaluate and mitigate accuracy, safety, and fairness concerns before using the model for high-risk scenarios. Ensure compliance with applicable laws and regulations (including privacy, trade compliance laws, etc.). |
Data Overview
Training Datasets
Our training data is an extension of the data used for Phi-3 and includes a wide variety of sources from:
Publicly available documents filtered rigorously for quality, selected high-quality educational data, and code.
Newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (science, daily activities, theory of mind, etc.).
Acquired academic books and Q&A datasets.
High-quality chat format supervised data covering various topics to reflect human preferences on different aspects such as instruct-following, truthfulness, honesty, and helpfulness.
Multilingual data constitutes about 8% of our overall data. We are focusing on the quality of data that could potentially improve the reasoning ability of the model, and we filter the publicly available documents to contain the correct level of knowledge.
Benchmark datasets
We evaluated phi-4
using OpenAI’s SimpleEval and our own internal benchmarks to understand the model’s capabilities, more specifically:
MMLU: Popular aggregated dataset for multitask language understanding.
MATH: Challenging competition math problems.
GPQA: Complex, graduate-level science questions.
DROP: Complex comprehension and reasoning.
MGSM: Multi-lingual grade-school math.
HumanEval: Functional code generation.
SimpleQA: Factual responses.
Safety
Approach
phi-4
has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated synthetic datasets. The overall technique employed to do the safety alignment is a combination of SFT (Supervised Fine-Tuning) and iterative DPO (Direct Preference Optimization), including publicly available datasets focusing on helpfulness and harmlessness as well as various questions and answers targeted to multiple safety categories.
Safety Evaluation and Red-Teaming
Prior to release, phi-4
followed a multi-faceted evaluation approach. Quantitative evaluation was conducted with multiple open-source safety benchmarks and in-house tools utilizing adversarial conversation simulation. For qualitative safety evaluation, we collaborated with the
- Downloads last month
- 5