gugarosa piero2c commited on
Commit
d85a24e
·
verified ·
1 Parent(s): 9936ea0

Update README.md (#2)

Browse files

- Update README.md (d8e605e17777f66f54e760a030eac6b42287fb10)


Co-authored-by: Piero Kauffmann <[email protected]>

Files changed (1) hide show
  1. README.md +9 -10
README.md CHANGED
@@ -100,17 +100,16 @@ To understand the capabilities, we compare `phi-4` with a set of models over Ope
100
 
101
  At the high-level overview of the model quality on representative benchmarks. For the table below, higher numbers indicate better performance:
102
 
103
- | **Category** | **Benchmark** | **Phi-4** | **GPT-4o-0806** | **GPT-4-Turbo-0409** | **GPT-4o-mini-0718** | **Llama-3.1-405B** | **Llama-3.1-70B** | **Phi-3.5-MoE** |
104
  |------------------------------|---------------|-----------|-----------------|----------------------|----------------------|--------------------|-------------------|-----------------|
105
- | Popular Aggregated Benchmark | MMLU | 85.2 | 88.0 | 86.7 | 80.9 | 86.4 | 80.5 | 79.6 |
106
- | Reasoning | DROP | 83.3 | 70.9 | 75.2 | 71.6 | 83.2 | 79.3 | 52.3 |
107
- | Science | GPQA | 58.1 | 52.5 | 50.0 | 37.4 | 50.5 | 45.0 | 35.4 |
108
- | Math | MGSM<br>MATH | 80.1<br>80.0 | 90.0<br>74.6 | 89.6<br>73.4 | 86.0<br>72.2 | 78.9<br>46.8 | 72.8<br>53.0 | 58.7<br>60.0 |
109
- | Factual Knowledge | SimpleQA | 7.1 | 39.1 | 24.2 | 9.3 | 17.7 | 12.5 | 6.3 |
110
- | Code Generation | HumanEval | 85.6 | 91.7 | 88.2 | 86.1 | 82.0 | 75.0 | 62.8 |
111
- | | **Average** | **68.5** | **72.4** | **69.6** | **63.4** | **63.6** | **59.7** | **50.7** |
112
-
113
- Overall, `phi-4` with only 14B parameters achieves a similar level of science and math capabilities as much larger models. Moreover, the model outperforms bigger models in reasoning capability, performing similarly to Llama-3.1-405B. However, it is still fundamentally limited by its size for certain tasks. The model simply does not have the capacity to store too much world knowledge, which can be seen for example with low performance on SimpleQA. However, we believe such weakness can be resolved by augmenting `phi-4` with a search engine.
114
 
115
  ## Usage
116
 
 
100
 
101
  At the high-level overview of the model quality on representative benchmarks. For the table below, higher numbers indicate better performance:
102
 
103
+ | **Category** | **Benchmark** | **phi-4** (14B) | **phi-3** (14B) | **Qwen 2.5** (14B instruct) | **GPT-4o-mini** | **Llama-3.3** (70B instruct) | **Qwen 2.5** (72B instruct) | **GPT-4o** |
104
  |------------------------------|---------------|-----------|-----------------|----------------------|----------------------|--------------------|-------------------|-----------------|
105
+ | Popular Aggregated Benchmark | MMLU | 84.8 | 77.9 | 79.9 | 81.8 | 86.3 | 85.3 | **88.1** |
106
+ | Science | GPQA | **56.1** | 31.2 | 42.9 | 40.9 | 49.1 | 49.0 | 50.6 |
107
+ | Math | MGSM<br>MATH | 80.6<br>**80.4** | 53.5<br>44.6 | 79.6<br>75.6 | 86.5<br>73.0 | 89.1<br>66.3* | 87.3<br>80.0 | **90.4**<br>74.6 |
108
+ | Code Generation | HumanEval | 82.6 | 67.8 | 72.1 | 86.2 | 78.9* | 80.4 | **90.6** |
109
+ | Factual Knowledge | SimpleQA | 3.0 | 7.6 | 5.4 | 9.9 | 20.9 | 10.2 | **39.4** |
110
+ | Reasoning | DROP | 75.5 | 68.3 | 85.5 | 79.3 | **90.2** | 76.7 | 80.9 |
111
+
112
+ \* These scores are lower than those reported by Meta, perhaps because simple-evals has a strict formatting requirement that Llama models have particular trouble following. We use the simple-evals framework because it is reproducible, but Meta reports 77 for MATH and 88 for HumanEval on Llama-3.3-70B.
 
113
 
114
  ## Usage
115