realolipop commited on
Commit
a9c823a
·
verified ·
1 Parent(s): 7248a14

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -20
README.md CHANGED
@@ -97,26 +97,26 @@ foundation for next-generation language model agents to reason and tackle real-w
97
 
98
  **Performance of MiniMax-M1 on core benchmarks.**
99
 
100
- | **Category** | **Task** | **MiniMax-M1-40K** | **MiniMax-M1-80K** | **OpenAI-o3** | **Gemini 2.5 Pro (06-05)** | **Claude 4 Opus** | **Seed-Thinking-v1.5** | **DeepSeek-R1** | **DeepSeek-R1-0528** | **Qwen3-235B-A22B** |
101
- |:---|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
102
- | | *Extended Thinking* | *40K* | *80K* | *100k* | *64k* | *64k* | *32k* | *32k* | *64k* | *32k* |
103
- | ***Mathematics*** | AIME 2024 | 83.3 | 86.0 | 91.6 | 92.0 | 76.0 | 86.7 | 79.8 | 91.4 | 85.7 |
104
- | | AIME 2025 | 74.6 | 76.9 | 88.9 | 88.0 | 75.5 | 74.0 | 70.0 | 87.5 | 81.5 |
105
- | | MATH-500 | 96.0 | 96.8 | 98.1 | 98.8 | 98.2 | 96.7 | 97.3 | 98.0 | 96.2 |
106
- | ***General Coding*** | LiveCodeBench *(24/8~25/5)* | 62.3 | 65.0 | 75.8 | 77.1 | 56.6 | 67.5 | 55.9 | 73.1 | 65.9 |
107
- | | FullStackBench | 67.6 | 68.3 | 69.3 | -- | 70.3 | 69.9 | 70.1 | 69.4 | 62.9 |
108
- | ***Reasoning & Knowledge***| GPQA Diamond | 69.2 | 70.0 | 83.3 | 86.4 | 79.6 | 77.3 | 71.5 | 81.0 | 71.1 |
109
- | | HLE *(no tools)* | 7.2\* | 8.4\* | 20.3 | 21.6 | 10.7 | 8.2 | 8.6\* | 17.7\* | 7.6\* |
110
- | | ZebraLogic | 80.1 | 86.8 | 95.8 | 91.6 | 95.1 | 84.4 | 78.7 | 95.1 | 80.3 |
111
- | | MMLU-Pro | 80.6 | 81.1 | 85.0 | 86.0 | 85.0 | 87.0 | 84.0 | 85.0 | 83.0 |
112
- | ***Software Engineering***| SWE-bench Verified| 55.6 | 56.0 | 69.1 | 67.2 | 72.5 | 47.0 | 49.2 | 57.6 | 34.4 |
113
- | ***Long Context*** | OpenAI-MRCR *(128k)* | 76.1 | 73.4 | 56.5 | 76.8 | 48.9 | 54.3 | 35.8 | 51.5 | 27.7 |
114
- | | OpenAI-MRCR *(1M)* | 58.6 | 56.2 | -- | 58.8 | -- | -- | -- | -- | -- |
115
- | | LongBench-v2 | 61.0 | 61.5 | 58.8 | 65.0 | 55.6 | 52.5 | 58.3 | 52.1 | 50.1 |
116
- | ***Agentic Tool Use***| TAU-bench *(airline)* | 60.0 | 62.0 | 52.0 | 50.0 | 59.6 | 44.0 | -- | 53.5 | 34.7 |
117
- | | TAU-bench *(retail)* | 67.8 | 63.5 | 73.9 | 67.0 | 81.4 | 55.7 | -- | 63.9 | 58.6 |
118
- | ***Factuality*** | SimpleQA | 17.9 | 18.5 | 49.4 | 54.0 | -- | 12.9 | 30.1 | 27.8 | 11.0 |
119
- | ***General Assistant***| MultiChallenge | 44.7 | 44.7 | 56.5 | 51.8 | 45.8 | 43.0 | 40.7 | 45.0 | 40.0 |
120
 
121
  \* conducted on the text-only HLE subset.
122
 
 
97
 
98
  **Performance of MiniMax-M1 on core benchmarks.**
99
 
100
+ | **MiniMax-M1-80K** | **MiniMax-M1-40K** | **Qwen3-235B-A22B** | **DeepSeek-R1-0528** | **DeepSeek-R1** | **Seed-Thinking-v1.5** | **Claude 4 Opus** | **Gemini 2.5 Pro (06-05)** | **OpenAI-o3** | **Task** | **Category** |
101
+ |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---|:---|
102
+ | *80K* | *40K* | *32k* | *64k* | *32k* | *32k* | *64k* | *64k* | *100k* | *Extended Thinking* | |
103
+ | 86.0 | 83.3 | 85.7 | 91.4 | 79.8 | 86.7 | 76.0 | 92.0 | 91.6 | AIME 2024 | ***Mathematics*** |
104
+ | 76.9 | 74.6 | 81.5 | 87.5 | 70.0 | 74.0 | 75.5 | 88.0 | 88.9 | AIME 2025 | |
105
+ | 96.8 | 96.0 | 96.2 | 98.0 | 97.3 | 96.7 | 98.2 | 98.8 | 98.1 | MATH-500 | |
106
+ | 65.0 | 62.3 | 65.9 | 73.1 | 55.9 | 67.5 | 56.6 | 77.1 | 75.8 | LiveCodeBench *(24/8~25/5)* | ***General Coding*** |
107
+ | 68.3 | 67.6 | 62.9 | 69.4 | 70.1 | 69.9 | 70.3 | -- | 69.3 | FullStackBench | |
108
+ | 70.0 | 69.2 | 71.1 | 81.0 | 71.5 | 77.3 | 79.6 | 86.4 | 83.3 | GPQA Diamond | ***Reasoning & Knowledge***|
109
+ | 8.4\* | 7.2\* | 7.6\* | 17.7\* | 8.6\* | 8.2 | 10.7 | 21.6 | 20.3 | HLE *(no tools)* | |
110
+ | 86.8 | 80.1 | 80.3 | 95.1 | 78.7 | 84.4 | 95.1 | 91.6 | 95.8 | ZebraLogic | |
111
+ | 81.1 | 80.6 | 83.0 | 85.0 | 84.0 | 87.0 | 85.0 | 86.0 | 85.0 | MMLU-Pro | |
112
+ | 56.0 | 55.6 | 34.4 | 57.6 | 49.2 | 47.0 | 72.5 | 67.2 | 69.1 | SWE-bench Verified| ***Software Engineering***|
113
+ | 73.4 | 76.1 | 27.7 | 51.5 | 35.8 | 54.3 | 48.9 | 76.8 | 56.5 | OpenAI-MRCR *(128k)* | ***Long Context*** |
114
+ | 56.2 | 58.6 | -- | -- | -- | -- | -- | 58.8 | -- | OpenAI-MRCR *(1M)* | |
115
+ | 61.5 | 61.0 | 50.1 | 52.1 | 58.3 | 52.5 | 55.6 | 65.0 | 58.8 | LongBench-v2 | |
116
+ | 62.0 | 60.0 | 34.7 | 53.5 | -- | 44.0 | 59.6 | 50.0 | 52.0 | TAU-bench *(airline)* | ***Agentic Tool Use***|
117
+ | 63.5 | 67.8 | 58.6 | 63.9 | -- | 55.7 | 81.4 | 67.0 | 73.9 | TAU-bench *(retail)* | |
118
+ | 18.5 | 17.9 | 11.0 | 27.8 | 30.1 | 12.9 | -- | 54.0 | 49.4 | SimpleQA | ***Factuality*** |
119
+ | 44.7 | 44.7 | 40.0 | 45.0 | 40.7 | 43.0 | 45.8 | 51.8 | 56.5 | MultiChallenge | ***General Assistant***|
120
 
121
  \* conducted on the text-only HLE subset.
122