microsoft
/

phi-4

@@ -97,17 +97,16 @@ To understand the capabilities, we compare `phi-4` with a set of models over Ope
 At the high-level overview of the model quality on representative benchmarks. For the table below, higher numbers indicate better performance:
-| **Category**                 | **Benchmark** | **Phi-4** | **GPT-4o-0806** | **GPT-4-Turbo-0409** | **GPT-4o-mini-0718** | **Llama-3.1-405B** | **Llama-3.1-70B** | **Phi-3.5-MoE** |
 |------------------------------|---------------|-----------|-----------------|----------------------|----------------------|--------------------|-------------------|-----------------|
-| Popular Aggregated Benchmark | MMLU          | 85.2      | 88.0            | 86.7                 | 80.9                 | 86.4               | 80.5              | 79.6            |
-| Reasoning                    | DROP          | 83.3      | 70.9            | 75.2                 | 71.6                 | 83.2               | 79.3              | 52.3            |
-| Science                      | GPQA          | 58.1      | 52.5            | 50.0                 | 37.4                 | 50.5               | 45.0              | 35.4            |
-| Math                         | MGSM<br>MATH          | 80.1<br>80.0      | 90.0<br>74.6            | 89.6<br>73.4                 | 86.0<br>72.2                 | 78.9<br>46.8               | 72.8<br>53.0              | 58.7<br>60.0            |
-| Factual Knowledge            | SimpleQA      | 7.1       | 39.1            | 24.2                 | 9.3                  | 17.7               | 12.5              | 6.3             |
-| Code Generation              | HumanEval     | 85.6      | 91.7            | 88.2                 | 86.1                 | 82.0               | 75.0              | 62.8            |
-|                              | **Average**   | **68.5**  | **72.4**        | **69.6**             | **63.4**             | **63.6**           | **59.7**          | **50.7**        |
-Overall, `phi-4` with only 14B parameters achieves a similar level of science and math capabilities as much larger models. Moreover, the model outperforms bigger models in reasoning capability, performing similarly to Llama-3.1-405B. However, it is still fundamentally limited by its size for certain tasks. The model simply does not have the capacity to store too much world knowledge, which can be seen for example with low performance on SimpleQA. However, we believe such weakness can be resolved by augmenting `phi-4` with a search engine.
 ## Usage

 At the high-level overview of the model quality on representative benchmarks. For the table below, higher numbers indicate better performance:
+| **Category**                 | **Benchmark** | **phi-4** (14B) | **phi-3** (14B) | **Qwen 2.5** (14B instruct) | **GPT-4o-mini** | **Llama-3.3** (70B instruct) | **Qwen 2.5** (72B instruct) | **GPT-4o** |
 |------------------------------|---------------|-----------|-----------------|----------------------|----------------------|--------------------|-------------------|-----------------|
+| Popular Aggregated Benchmark | MMLU          | 84.8      | 77.9            | 79.9                 | 81.8                 | 86.3               | 85.3              | **88.1**            |
+| Science                      | GPQA          | **56.1**      | 31.2            | 42.9                 | 40.9                 | 49.1               | 49.0              | 50.6            |
+| Math                         | MGSM<br>MATH  | 80.6<br>**80.4** | 53.5<br>44.6 | 79.6<br>75.6 | 86.5<br>73.0 | 89.1<br>66.3* | 87.3<br>80.0              | **90.4**<br>74.6            |
+| Code Generation              | HumanEval     | 82.6      | 67.8            | 72.1                 | 86.2                 | 78.9*               | 80.4              | **90.6**            |
+| Factual Knowledge            | SimpleQA      | 3.0       | 7.6            | 5.4                 | 9.9                  | 20.9               | 10.2              | **39.4**             |
+| Reasoning                    | DROP          | 75.5      | 68.3            | 85.5                 | 79.3                 | **90.2**               | 76.7              | 80.9            |
+\* These scores are lower than those reported by Meta, perhaps because simple-evals has a strict formatting requirement that Llama models have particular trouble following. We use the simple-evals framework because it is reproducible, but Meta reports 77 for MATH and 88 for HumanEval on Llama-3.3-70B.
 ## Usage