ZwwWayne commited on
Commit
313afc0
1 Parent(s): 25f6a3e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -41
README.md CHANGED
@@ -26,31 +26,37 @@ pipeline_tag: text-generation
26
 
27
  ## Introduction
28
 
29
- InternLM has open-sourced a 7 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:
30
- - It leverages trillions of high-quality tokens for training to establish a powerful knowledge base.
31
- - It supports an 8k context window length, enabling longer input sequences and stronger reasoning capabilities.
32
- - It provides a versatile toolset for users to flexibly build their own workflows.
33
 
34
- ## InternLM-7B
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ### Performance Evaluation
37
 
38
- We conducted a comprehensive evaluation of InternLM using the open-source evaluation tool [OpenCompass](https://github.com/internLM/OpenCompass/). The evaluation covered five dimensions of capabilities: disciplinary competence, language competence, knowledge competence, inference competence, and comprehension competence. Here are some of the evaluation results, and you can visit the [OpenCompass leaderboard](https://opencompass.org.cn/rank) for more evaluation results.
39
-
40
- | Datasets\Models | **InternLM-Chat-7B** | **InternLM-7B** | LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
41
- | -------------------- | --------------------- | ---------------- | --------- | --------- | ------------ | --------- | ---------- |
42
- | C-Eval(Val) | 53.2 | 53.4 | 24.2 | 42.7 | 50.9 | 28.9 | 31.2 |
43
- | MMLU | 50.8 | 51.0 | 35.2* | 41.5 | 46.0 | 39.7 | 47.3 |
44
- | AGIEval | 42.5 | 37.6 | 20.8 | 24.6 | 39.0 | 24.1 | 26.4 |
45
- | CommonSenseQA | 75.2 | 59.5 | 65.0 | 58.8 | 60.0 | 68.7 | 66.7 |
46
- | BUSTM | 74.3 | 50.6 | 48.5 | 51.3 | 55.0 | 48.8 | 62.5 |
47
- | CLUEWSC | 78.6 | 59.1 | 50.3 | 52.8 | 59.8 | 50.3 | 52.2 |
48
- | MATH | 6.4 | 7.1 | 2.8 | 3.0 | 6.6 | 2.2 | 2.8 |
49
- | GSM8K | 34.5 | 31.2 | 10.1 | 9.7 | 29.2 | 6.0 | 15.3 |
50
- | HumanEval | 14.0 | 10.4 | 14.0 | 9.2 | 9.2 | 9.2 | 11.0 |
51
- | RACE(High) | 76.3 | 57.4 | 46.9* | 28.1 | 66.3 | 40.7 | 54.0 |
52
-
53
- - The evaluation results were obtained from [OpenCompass 20230706](https://github.com/internLM/OpenCompass/) (some data marked with *, which means come from the original papers), and evaluation configuration can be found in the configuration files provided by [OpenCompass](https://github.com/internLM/OpenCompass/).
54
  - The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/).
55
 
56
 
@@ -58,6 +64,7 @@ We conducted a comprehensive evaluation of InternLM using the open-source evalua
58
 
59
  ### Import from Transformers
60
  To load the InternLM 7B Chat model using Transformers, use the following code:
 
61
  ```python
62
  import torch
63
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -94,37 +101,43 @@ for response, history in model.stream_chat(tokenizer, "Hello", history=[]):
94
  The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow **free** commercial usage. To apply for a commercial license, please fill in the [application form (English)](https://wj.qq.com/s2/12727483/5dba/)/[申请表(中文)](https://wj.qq.com/s2/12725412/f7c1/). For other questions or collaborations, please contact <[email protected]>.
95
 
96
  ## 简介
97
- InternLM ,即书生·浦语大模型,包含面向实用场景的70亿参数基础模型与对话模型 (InternLM-7B)。模型具有以下特点:
98
- - 使用上万亿高质量预料,建立模型超强知识体系;
99
- - 支持8k语境窗口长度,实现更长输入与更强推理体验;
100
- - 通用工具调用能力,支持用户灵活自助搭建流程;
101
 
102
- ## InternLM-7B
 
 
 
 
 
 
 
 
 
 
 
103
 
104
  ### 性能评测
105
 
106
  我们使用开源评测工具 [OpenCompass](https://github.com/internLM/OpenCompass/) 从学科综合能力、语言能力、知识能力、推理能力、理解能力五大能力维度对InternLM开展全面评测,部分评测结果如下表所示,欢迎访问[ OpenCompass 榜单 ](https://opencompass.org.cn/rank)获取更多的评测结果。
107
 
108
- | 数据集\模型 | **InternLM-Chat-7B** | **InternLM-7B** | LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
109
- | -------------------- | --------------------- | ---------------- | --------- | --------- | ------------ | --------- | ---------- |
110
- | C-Eval(Val) | 53.2 | 53.4 | 24.2 | 42.7 | 50.9 | 28.9 | 31.2 |
111
- | MMLU | 50.8 | 51.0 | 35.2* | 41.5 | 46.0 | 39.7 | 47.3 |
112
- | AGIEval | 42.5 | 37.6 | 20.8 | 24.6 | 39.0 | 24.1 | 26.4 |
113
- | CommonSenseQA | 75.2 | 59.5 | 65.0 | 58.8 | 60.0 | 68.7 | 66.7 |
114
- | BUSTM | 74.3 | 50.6 | 48.5 | 51.3 | 55.0 | 48.8 | 62.5 |
115
- | CLUEWSC | 78.6 | 59.1 | 50.3 | 52.8 | 59.8 | 50.3 | 52.2 |
116
- | MATH | 6.4 | 7.1 | 2.8 | 3.0 | 6.6 | 2.2 | 2.8 |
117
- | GSM8K | 34.5 | 31.2 | 10.1 | 9.7 | 29.2 | 6.0 | 15.3 |
118
- | HumanEval | 14.0 | 10.4 | 14.0 | 9.2 | 9.2 | 9.2 | 11.0 |
119
- | RACE(High) | 76.3 | 57.4 | 46.9* | 28.1 | 66.3 | 40.7 | 54.0 |
120
-
121
- - 以上评测结果基于 [OpenCompass 20230706](https://github.com/internLM/OpenCompass/) 获得(部分数据标注`*`代表数据来自原始论文),具体测试细节可参见 [OpenCompass](https://github.com/internLM/OpenCompass/) 中提供的配置文件。
122
  - 评测数据会因 [OpenCompass](https://github.com/internLM/OpenCompass/) 的版本迭代而存在数值差异,请以 [OpenCompass](https://github.com/internLM/OpenCompass/) 最新版的评测结果为主。
123
 
124
  **局限性:** 尽管在训练过程中我们非常注重模型的安全性,尽力促使模型输出符合伦理和法律要求的文本,但受限于模型大小以及概率生成范式,模型可能会产生各种不符合预期的输出,例如回复内容包含偏见、歧视等有害内容,请勿传播这些内容。由于传播不良信息导致的任何后果,本项目不承担责任。
125
 
126
  ### 通过 Transformers 加载
127
- 通过以下的代码加载 InternLM 7B Chat 模型
 
 
128
  ```python
129
  import torch
130
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
26
 
27
  ## Introduction
28
 
29
+ InternLM2 has open-sourced a 7 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:
 
 
 
30
 
31
+ - **200K Context window**: Nearly perfect at finding needles in the haystack with 200K-long context, with leading performance on long-context tasks like LongBench and L-Eval. Try it with [LMDeploy](https://github.com/InternLM/lmdeploy) for 200K-context inference.
32
+
33
+ - **Outstanding comprehensive performance**: Significantly better than the last generation in all dimensions, especially in reasoning, math, code, chat experience, instruction following, and creative writing, with leading performance among open-source models in similar sizes. In some evaluations, InternLM2-Chat-20B may match or even surpass ChatGPT (GPT-3.5).
34
+
35
+ - **Code interpreter & Data analysis**: With code interpreter, InternLM2-Chat-20B obtains compatible performance with GPT-4 on GSM8K and MATH. InternLM2-Chat also provides data analysis capability.
36
+
37
+ - **Stronger tool use**: Based on better tool utilization-related capabilities in instruction following, tool selection and reflection, InternLM2 can support more kinds of agents and multi-step tool calling for complex tasks. See [examples](https://github.com/InternLM/lagent).
38
+
39
+
40
+ ## InternLM2-Chat-7B-SFT
41
+
42
+ InternLM2-Chat-7B-SFT is the SFT version based on InternLM2-Base, and InternLM2-Chat-7B is further trained from InternLM2-Chat-7B-SFT by Online RLHF.
43
+ We release the SFT version so that the community can study the influence of RLHF deeply.
44
 
45
  ### Performance Evaluation
46
 
47
+ We conducted a comprehensive evaluation of InternLM2 using the open-source evaluation tool [OpenCompass](https://github.com/internLM/OpenCompass/). The evaluation covered five dimensions of capabilities: disciplinary competence, language competence, knowledge competence, inference competence, and comprehension competence. Here are some of the evaluation results, and you can visit the [OpenCompass leaderboard](https://opencompass.org.cn/rank) for more evaluation results.
48
+
49
+ | Dataset\Models | InternLM2-7B | InternLM2-Chat-7B | InternLM2-20B | InternLM2-Chat-20B | ChatGPT | GPT-4 |
50
+ | --- | --- | --- | --- | --- | --- | --- |
51
+ | MMLU | 65.8 | 63.7 | 67.7 | 66.5 | 69.1 | 83.0 |
52
+ | AGIEval | 49.9 | 47.2 | 53.0 | 50.3 | 39.9 | 55.1 |
53
+ | BBH | 65.0 | 61.2 | 72.1 | 68.3 | 70.1 | 86.7 |
54
+ | GSM8K | 70.8 | 70.7 | 76.1 | 79.6 | 78.2 | 91.4 |
55
+ | MATH | 20.2 | 23.0 | 25.5 | 31.9 | 28.0 | 45.8 |
56
+ | HumanEval | 43.3 | 59.8 | 48.8 | 67.1 | 73.2 | 74.4 |
57
+ | MBPP(Sanitized) | 51.8 | 51.4 | 63.0 | 65.8 | 78.9 | 79.0 |
58
+
59
+ - The evaluation results were obtained from [OpenCompass](https://github.com/internLM/OpenCompass/) (some data marked with *, which means come from the original papers), and evaluation configuration can be found in the configuration files provided by [OpenCompass](https://github.com/internLM/OpenCompass/).
 
 
 
60
  - The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/).
61
 
62
 
 
64
 
65
  ### Import from Transformers
66
  To load the InternLM 7B Chat model using Transformers, use the following code:
67
+
68
  ```python
69
  import torch
70
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
101
  The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow **free** commercial usage. To apply for a commercial license, please fill in the [application form (English)](https://wj.qq.com/s2/12727483/5dba/)/[申请表(中文)](https://wj.qq.com/s2/12725412/f7c1/). For other questions or collaborations, please contact <[email protected]>.
102
 
103
  ## 简介
 
 
 
 
104
 
105
+ InternLM2 ,即书生·浦语大模型第二代,开源了面向实用场景的70亿参数基础模型与对话模型 (InternLM2-Chat-7B)。模型具有以下特点:
106
+
107
+ - 有效支持20万字超长上下文:模型在20万字长输入中几乎完美地实现长文“大海捞针”,而且在 LongBench 和 L-Eval 等长文任务中的表现也达到开源模型中的领先水平。 可以通过 [LMDeploy](./inference/) 尝试20万字超长上下文推理。
108
+ - 综合性能全面提升:各能力维度相比上一代模型全面进步,在推理、数学、代码、对话体验、指令遵循和创意写作等方面的能力提升尤为显著,综合性能达到同量级开源模型的领先水平,在重点能力评测上 InternLM2-Chat-20B 能比肩甚至超越 ChatGPT (GPT-3.5)。
109
+ - 代码解释器与数据分析:在配合代码解释器(code-interpreter)的条件下,InternLM2-Chat-20B 在 GSM8K 和 MATH 上可以达到和 GPT-4 相仿的水平。基于在数理和工具方面强大的基础能力,InternLM2-Chat 提供了实用的数据分析能力。
110
+ - 工具调用能力整体升级:基于更强和更具有泛化性的指令理解、工具筛选与结果反思等能力,新版模型可以更可靠地支持复杂智能体的搭建,支持对工具进行有效的多轮调用,完成较复杂的任务。可以查看更多[样例](./agent/)。
111
+
112
+
113
+ ## InternLM2-Chat-7B-SFT
114
+
115
+ InternLM2-Chat-7B-SFT 基于 InternLM2-Base-7B 经过有监督微调(SFT)训练而来,InternLM2-Chat-7B 在 InternLM2-Chat-7B-SFT 的基础上进一步经历了 Online RLHF。
116
+ 我们开源 SFT 模型以便利社区对 RLHF 的研究。
117
 
118
  ### 性能评测
119
 
120
  我们使用开源评测工具 [OpenCompass](https://github.com/internLM/OpenCompass/) 从学科综合能力、语言能力、知识能力、推理能力、理解能力五大能力维度对InternLM开展全面评测,部分评测结果如下表所示,欢迎访问[ OpenCompass 榜单 ](https://opencompass.org.cn/rank)获取更多的评测结果。
121
 
122
+ | 评测集\模型 | InternLM2-7B | InternLM2-Chat-7B | InternLM2-20B | InternLM2-Chat-20B | ChatGPT | GPT-4 |
123
+ | --- | --- | --- | --- | --- | --- | --- |
124
+ | MMLU | 65.8 | 63.7 | 67.7 | 66.5 | 69.1 | 83.0 |
125
+ | AGIEval | 49.9 | 47.2 | 53.0 | 50.3 | 39.9 | 55.1 |
126
+ | BBH | 65.0 | 61.2 | 72.1 | 68.3 | 70.1 | 86.7 |
127
+ | GSM8K | 70.8 | 70.7 | 76.1 | 79.6 | 78.2 | 91.4 |
128
+ | MATH | 20.2 | 23.0 | 25.5 | 31.9 | 28.0 | 45.8 |
129
+ | HumanEval | 43.3 | 59.8 | 48.8 | 67.1 | 73.2 | 74.4 |
130
+ | MBPP(Sanitized) | 51.8 | 51.4 | 63.0 | 65.8 | 78.9 | 79.0 |
131
+
132
+ - 以上评测结果基于 [OpenCompass](https://github.com/internLM/OpenCompass/) 获得(部分数据标注`*`代表数据来自原始论文),具体测试细节可参见 [OpenCompass](https://github.com/internLM/OpenCompass/) 中提供的配置文件。
 
 
 
133
  - 评测数据会因 [OpenCompass](https://github.com/internLM/OpenCompass/) 的版本迭代而存在数值差异,请以 [OpenCompass](https://github.com/internLM/OpenCompass/) 最新版的评测结果为主。
134
 
135
  **局限性:** 尽管在训练过程中我们非常注重模型的安全性,尽力促使模型输出符合伦理和法律要求的文本,但受限于模型大小以及概率生成范式,模型可能会产生各种不符合预期的输出,例如回复内容包含偏见、歧视等有害内容,请勿传播这些内容。由于传播不良信息导致的任何后果,本项目不承担责任。
136
 
137
  ### 通过 Transformers 加载
138
+
139
+ 通过以下的代码加载 InternLM2 7B Chat SFT 模型
140
+
141
  ```python
142
  import torch
143
  from transformers import AutoTokenizer, AutoModelForCausalLM