piero2c commited on
Commit
d8e605e
·
verified ·
1 Parent(s): c61ee78

Update README.md

Browse files

* Update table with latest numbers and newer models
* Adds Llama-3.3-70B MATH and HE footnote

Files changed (1) hide show
  1. README.md +9 -10
README.md CHANGED
@@ -97,17 +97,16 @@ To understand the capabilities, we compare `phi-4` with a set of models over Ope
97
 
98
  At the high-level overview of the model quality on representative benchmarks. For the table below, higher numbers indicate better performance:
99
 
100
- | **Category** | **Benchmark** | **Phi-4** | **GPT-4o-0806** | **GPT-4-Turbo-0409** | **GPT-4o-mini-0718** | **Llama-3.1-405B** | **Llama-3.1-70B** | **Phi-3.5-MoE** |
101
  |------------------------------|---------------|-----------|-----------------|----------------------|----------------------|--------------------|-------------------|-----------------|
102
- | Popular Aggregated Benchmark | MMLU | 85.2 | 88.0 | 86.7 | 80.9 | 86.4 | 80.5 | 79.6 |
103
- | Reasoning | DROP | 83.3 | 70.9 | 75.2 | 71.6 | 83.2 | 79.3 | 52.3 |
104
- | Science | GPQA | 58.1 | 52.5 | 50.0 | 37.4 | 50.5 | 45.0 | 35.4 |
105
- | Math | MGSM<br>MATH | 80.1<br>80.0 | 90.0<br>74.6 | 89.6<br>73.4 | 86.0<br>72.2 | 78.9<br>46.8 | 72.8<br>53.0 | 58.7<br>60.0 |
106
- | Factual Knowledge | SimpleQA | 7.1 | 39.1 | 24.2 | 9.3 | 17.7 | 12.5 | 6.3 |
107
- | Code Generation | HumanEval | 85.6 | 91.7 | 88.2 | 86.1 | 82.0 | 75.0 | 62.8 |
108
- | | **Average** | **68.5** | **72.4** | **69.6** | **63.4** | **63.6** | **59.7** | **50.7** |
109
-
110
- Overall, `phi-4` with only 14B parameters achieves a similar level of science and math capabilities as much larger models. Moreover, the model outperforms bigger models in reasoning capability, performing similarly to Llama-3.1-405B. However, it is still fundamentally limited by its size for certain tasks. The model simply does not have the capacity to store too much world knowledge, which can be seen for example with low performance on SimpleQA. However, we believe such weakness can be resolved by augmenting `phi-4` with a search engine.
111
 
112
  ## Usage
113
 
 
97
 
98
  At the high-level overview of the model quality on representative benchmarks. For the table below, higher numbers indicate better performance:
99
 
100
+ | **Category** | **Benchmark** | **phi-4** (14B) | **phi-3** (14B) | **Qwen 2.5** (14B instruct) | **GPT-4o-mini** | **Llama-3.3** (70B instruct) | **Qwen 2.5** (72B instruct) | **GPT-4o** |
101
  |------------------------------|---------------|-----------|-----------------|----------------------|----------------------|--------------------|-------------------|-----------------|
102
+ | Popular Aggregated Benchmark | MMLU | 84.8 | 77.9 | 79.9 | 81.8 | 86.3 | 85.3 | **88.1** |
103
+ | Science | GPQA | **56.1** | 31.2 | 42.9 | 40.9 | 49.1 | 49.0 | 50.6 |
104
+ | Math | MGSM<br>MATH | 80.6<br>**80.4** | 53.5<br>44.6 | 79.6<br>75.6 | 86.5<br>73.0 | 89.1<br>66.3* | 87.3<br>80.0 | **90.4**<br>74.6 |
105
+ | Code Generation | HumanEval | 82.6 | 67.8 | 72.1 | 86.2 | 78.9* | 80.4 | **90.6** |
106
+ | Factual Knowledge | SimpleQA | 3.0 | 7.6 | 5.4 | 9.9 | 20.9 | 10.2 | **39.4** |
107
+ | Reasoning | DROP | 75.5 | 68.3 | 85.5 | 79.3 | **90.2** | 76.7 | 80.9 |
108
+
109
+ \* These scores are lower than those reported by Meta, perhaps because simple-evals has a strict formatting requirement that Llama models have particular trouble following. We use the simple-evals framework because it is reproducible, but Meta reports 77 for MATH and 88 for HumanEval on Llama-3.3-70B.
 
110
 
111
  ## Usage
112