File size: 5,452 Bytes
196f1f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
pipeline_tag: text-generation
license: other
language:
- en
- zh
tags:
- math
base_model: internlm/internlm2-math-plus-1_8b
---

# InternLM-Math-Plus-GGUF
This is quantized version of [internlm/internlm2-math-plus-1_8b]() created using llama.cpp

# Model Description
<div align="center">

<img src="https://raw.githubusercontent.com/InternLM/InternLM/main/assets/logo.svg" width="200"/>
  <div> </div>
  <div align="center">
    <b><font size="5">InternLM-Math</font></b>
    <sup>
      <a href="https://internlm.intern-ai.org.cn/">
        <i><font size="4">Plus</font></i>
      </a>
    </sup>
    <div> </div>
  </div>

State-of-the-art bilingual open-sourced Math reasoning LLMs. 
A **solver**, **prover**, **verifier**, **augmentor**.


# News
- [2024.05.24] We release updated version InternLM2-Math-Plus with 4 sizes and state-of-the-art performances including 1.8B, 7B, 20B, and 8x22B. We improve informal math reasoning performance (chain-of-thought and code-intepreter) and formal math reasoning performance (LEAN 4 translation and LEAN 4 theorem proving) significantly.
- [2024.02.10] We add tech reports and citation reference.
- [2024.01.31] We add MiniF2F results with evaluation codes!
- [2024.01.29] We add checkpoints from ModelScope. Update results about majority voting and Code Intepreter. Tech report is on the way!
- [2024.01.26] We add checkpoints from OpenXLab, which ease Chinese users to download!

# Performance

## Formal Math Reasoning
We evaluate the performance of InternLM2-Math-Plus on formal math reasoning benchmark MiniF2F-test. The evaluation setting is same as Llemma with LEAN 4.
| Models                           | MiniF2F-test |
| -------------------------------- | ------------ |
| ReProver                         | 26.5         |
| LLMStep                          | 27.9         |
| GPT-F                            | 36.6         |
| HTPS                             | 41.0         |
| Llemma-7B                        | 26.2         |
| Llemma-34B                       | 25.8         |
| InternLM2-Math-7B-Base           | 30.3         |
| InternLM2-Math-20B-Base          | 29.5         |
| InternLM2-Math-Plus-1.8B         | 38.9         |
| InternLM2-Math-Plus-7B           | **43.4**     |
| InternLM2-Math-Plus-20B          | 42.6         |
| InternLM2-Math-Plus-Mixtral8x22B | 37.3         |

## Informal Math Reasoning
We evaluate the performance of InternLM2-Math-Plus on informal math reasoning benchmark MATH and GSM8K. InternLM2-Math-Plus-1.8B outperforms MiniCPM-2B in the smallest size setting. InternLM2-Math-Plus-7B outperforms Deepseek-Math-7B-RL which is the state-of-the-art math reasoning open source model. InternLM2-Math-Plus-Mixtral8x22B achieves 68.5 on MATH (with Python) and 91.8 on GSM8K.
| Model                            | MATH     | MATH-Python | GSM8K    |
| -------------------------------- | -------- | ----------- | -------- |
| MiniCPM-2B                       | 10.2     | -           | 53.8     |
| InternLM2-Math-Plus-1.8B         | **37.0** | **41.5**    | **58.8** |
| InternLM2-Math-7B                | 34.6     | 50.9        | 78.1     |
| Deepseek-Math-7B-RL              | 51.7     | 58.8        | **88.2** |
| InternLM2-Math-Plus-7B           | **53.0** | **59.7**    | 85.8     |
| InternLM2-Math-20B               | 37.7     | 54.3        | 82.6     |
| InternLM2-Math-Plus-20B          | **53.8** | **61.8**    | **87.7** |
| Mixtral8x22B-Instruct-v0.1       | 41.8     | -           | 78.6     |
| Eurux-8x22B-NCA                  | 49.0     | -           | -        |
| InternLM2-Math-Plus-Mixtral8x22B | **58.1** | **68.5**    | **91.8** |

We also evaluate models on [MathBench-A](https://github.com/open-compass/MathBench). InternLM2-Math-Plus-Mixtral8x22B has comparable performance compared to Claude 3 Opus.
| Model                            | Arithmetic | Primary | Middle | High | College | Average |
| -------------------------------- | ---------- | ------- | ------ | ---- | ------- | ------- |
| GPT-4o-0513                      | 77.7       | 87.7    | 76.3   | 59.0 | 54.0    | 70.9    |
| Claude 3 Opus                    | 85.7       | 85.0    | 58.0   | 42.7 | 43.7    | 63.0    |
| Qwen-Max-0428                    | 72.3       | 86.3    | 65.0   | 45.0 | 27.3    | 59.2    |
| Qwen-1.5-110B                    | 70.3       | 82.3    | 64.0   | 47.3 | 28.0    | 58.4    |
| Deepseek-V2                      | 82.7       | 89.3    | 59.0   | 39.3 | 29.3    | 59.9    |
| Llama-3-70B-Instruct             | 70.3       | 86.0    | 53.0   | 38.7 | 34.7    | 56.5    |
| InternLM2-Math-Plus-Mixtral8x22B | 77.5       | 82.0    | 63.6   | 50.3 | 36.8    | 62.0    |
| InternLM2-Math-20B               | 58.7       | 70.0    | 43.7   | 24.7 | 12.7    | 42.0    |
| InternLM2-Math-Plus-20B          | 65.8       | 79.7    | 59.5   | 47.6 | 24.8    | 55.5    |
| Llama3-8B-Instruct               | 54.7       | 71.0    | 25.0   | 19.0 | 14.0    | 36.7    |
| InternLM2-Math-7B                | 53.7       | 67.0    | 41.3   | 18.3 | 8.0     | 37.7    |
| Deepseek-Math-7B-RL              | 68.0       | 83.3    | 44.3   | 33.0 | 23.0    | 50.3    |
| InternLM2-Math-Plus-7B           | 61.4       | 78.3    | 52.5   | 40.5 | 21.7    | 50.9    |
| MiniCPM-2B                       | 49.3       | 51.7    | 18.0   | 8.7  | 3.7     | 26.3    |
| InternLM2-Math-Plus-1.8B         | 43.0       | 43.3    | 25.4   | 18.9 | 4.7     | 27.1    |