itpossible commited on
Commit
ccde124
·
verified ·
1 Parent(s): b4addf2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +178 -14
README.md CHANGED
@@ -1,19 +1,183 @@
1
- ## 🎉 新闻
2
- - [2024-10-11] [新文速递|PreparedLLM:高效训练领域大语言模型的“前预训练”框架](https://mp.weixin.qq.com/s/ugJQ9tbp6Y87xA3TOWteqw)。
3
- - [2024-08-31] 文章[PreparedLLM: Effective Pre-pretraining Framework for Domain-specific Large Language Models](https://www.tandfonline.com/doi/full/10.1080/20964471.2024.2396159)已被*Big Earth Data*期刊接收。
4
- - [2024-08-31] 发布[Chinese-Mistral-7B-Instruct-v0.2](https://huggingface.co/itpossible/Chinese-Mistral-7B-Instruct-v0.2)对话模型。语言理解能力大幅提高,并且具备多轮对话能力。
5
- - [2024-06-30] 发布[JiuZhou-Instruct-v0.2](https://huggingface.co/itpossible/JiuZhou-Instruct-v0.2)对话模型。语言理解能力大幅提高,并且具备多轮对话能力。
6
- - [2024-04-04] 发布[Chinese-Mistral-7B-Instruct-v0.1](https://huggingface.co/itpossible/Chinese-Mistral-7B-Instruct-v0.1)。
7
- - [2024-03-31] 发布[JiuZhou-base](https://huggingface.co/itpossible/JiuZhou-base]和[Chinese-Mistral-7B-v0.1](https://huggingface.co/itpossible/Chinese-Mistral-7B)基座模型。
8
 
9
 
10
 
11
- JiuZhou, a powerful bilingual LLM with 7 billion parameters developed by the Tsinghua research team.<br>
12
- Use case: We are using a widely circulated and interesting question.<br>
13
- Qestion: 9.11 and 9.9 - which is bigger?<br>
14
- · JiuZhou answer correctly.
15
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64ccae20bb5d195b9947f99f/W1mF-3rz-HHI4e6HdeukJ.png)
16
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
- · ChatGPT, Gemini, Moonshot Al, Qianwen, Mixtral 8x7B, Llama 3 all answer incorrectly.
19
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64ccae20bb5d195b9947f99f/wNWYl5Ch050n7GejE0voz.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+ <h1>
3
+ JiuZhou: Open Foundation Language Models for Geoscience
4
+ </h1>
5
+ </div>
 
 
6
 
7
 
8
 
9
+ \[ English | [中文](README_zh.md) \]
 
 
 
 
10
 
11
+ ## 🎉 News
12
+ - [2024-12-31] **Article [JiuZhou: Open Foundation Language Models and Effective Pre-training Framework for Geoscience](https://www.tandfonline.com/doi/full/10.1080/17538947.2025.2449708) has been accepted for publication in the *International Journal fo Digital Earth***. [Code and Data](https://github.com/THU-ESIS/JiuZhou).
13
+ - [2024-10-11] WeChat article: [PreparedLLM: Effective Pre-pretraining Framework for Domain-specific Large Language Models](https://mp.weixin.qq.com/s/ugJQ9tbp6Y87xA3TOWteqw).
14
+ - [2024-09-06] Released [ClimateChat](https://huggingface.co/itpossible/ClimateChat) instruct model.
15
+ - [2024-08-31] **Article [PreparedLLM: Effective Pre-pretraining Framework for Domain-specific Large Language Models](https://www.tandfonline.com/doi/full/10.1080/20964471.2024.2396159) has been accepted for publication in the *Big Earth Data* journal**.
16
+ - [2024-08-31] Released [Chinese-Mistral-7B-Instruct-v0.2](https://huggingface.co/itpossible/Chinese-Mistral-7B-Instruct-v0.2) instruct model. Significant improvements in language understanding and multi-turn dialogue capabilities.
17
+ - [2024-06-30] Released [JiuZhou-Instruct-v0.2](https://huggingface.co/itpossible/JiuZhou-Instruct-v0.2) instruct model. Significant improvements in language understanding and multi-turn dialogue capabilities.
18
+ - [2024-05-15] WeChat Article: [Chinese Vocabulary Expansion Incremental Pretraining for Large Language Models: Chinese-Mistral Released](https://mp.weixin.qq.com/s/PMQmRCZMWosWMfgKRBjLlQ).
19
+ - [2024-04-04] Released [Chinese-Mistral-7B-Instruct-v0.1](https://huggingface.co/itpossible/Chinese-Mistral-7B-Instruct-v0.1) instruct model.
20
+ - [2024-03-31] Released [Chinese-Mistral-7B-v0.1](https://huggingface.co/itpossible/Chinese-Mistral-7B) base model.
21
+ - [2024-03-15] Released the base version [JiuZhou-base](https://huggingface.co/itpossible/JiuZhou-base), instruct version [JiuZhou-instruct-v0.1](https://huggingface.co/itpossible/JiuZhou-Instruct-v0.1), and [intermediate checkpoints](https://huggingface.co/itpossible).
22
 
23
+
24
+ ## Table of Contents
25
+
26
+ - [Introduction](#introduction)
27
+ - [Download](#download)
28
+ - [Inference](#inference)
29
+ - [Model Performance](#model-performance)
30
+ - [Model Training Process](#model-training-process)
31
+ - [Model Training Code](#model-training-code)
32
+ - [Citations](#citations)
33
+ - [Acknowledgments](#acknowledgments)
34
+
35
+ ## Introduction
36
+ The field of geoscience has amassed a vast amount of data, necessitating the extraction and integration of diverse knowledge from this data to address global change challenges, promote sustainable development, and accelerate scientific discovery. Foundation language models initially learn and integrate knowledge autonomously through self-supervised pre-training on extensive text data. Subsequently, they acquire the capability to solve geoscience problems through instruction tuning. However, when the foundational language models lack sufficient geoscience expertise, instruction tuning with relevant data can lead to the generation of content that is inconsistent with established facts. To improve the model's accuracy and practicality, a robust geoscience foundational language model is urgently needed.<br>
37
+
38
+ This study uses [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) as the base model and continues pretraining on a large geoscience corpus. It also incorporates the [domain-specific large language model *pre*-pretraining framework (PreparedLLM)](https://www.tandfonline.com/doi/full/10.1080/20964471.2024.2396159) and the "two-stage pre-adaptation pre-training" algorithm to build the geoscience large language model, JiuZhou.
39
+
40
+
41
+ ## Download
42
+
43
+ | **Model Series** | **Model** | **Download Link** | **Description** |
44
+ |-----------------------|-------------------------------------|------------------------------------------------------------|------------------------------------------------------------------|
45
+ | **JiuZhou** | JiuZhou-base | [Huggingface](https://huggingface.co/itpossible/JiuZhou-base) | Base model (Rich in geoscience knowledge) |
46
+ | **JiuZhou** | JiuZhou-Instruct-v0.1 | [Huggingface](https://huggingface.co/itpossible/Chinese-Mistral-7B-Instruct-v0.1) | Instruct model (Instruction alignment caused a loss of some geoscience knowledge, but it has instruction-following ability) <br> LoRA fine-tuned on Alpaca_GPT4 in both Chinese and English and GeoSignal |
47
+ | **JiuZhou** | JiuZhou-Instruct-v0.2 | [HuggingFace](https://huggingface.co/itpossible/Chinese-Mistral-7B-Instruct-v0.2)<br>[Wisemodel](https://wisemodel.cn/models/itpossible/Chinese-Mistral-7B-Instruct-v0.2) | Instruct model (Instruction alignment caused a loss of some geoscience knowledge, but it has instruction-following ability) <br> Fine-tuned with high-quality general instruction data |
48
+ | **ClimateChat** | ClimateChat | [HuggingFace](https://huggingface.co/itpossible/ClimateChat)<br>[Wisemodel](https://wisemodel.cn/models/itpossible/ClimateChat) | Instruct model <br> Fine-tuned on JiuZhou-base for instruction following |
49
+ | **Chinese-Mistral** | Chinese-Mistral-7B | [HuggingFace](https://huggingface.co/itpossible/Chinese-Mistral-7B-v0.1)<br>[Wisemodel](https://wisemodel.cn/models/itpossible/Chinese-Mistral-7B-v0.1)<br>[ModelScope](https://www.modelscope.cn/models/itpossible/Chinese-Mistral-7B-v0.1) | Base model |
50
+ | **Chinese-Mistral** | Chinese-Mistral-7B-Instruct-v0.1 | [HuggingFace](https://huggingface.co/itpossible/Chinese-Mistral-7B-Instruct-v0.1)<br>[Wisemodel](https://wisemodel.cn/models/itpossible/Chinese-Mistral-7B-Instruct-v0.1)<br>[ModelScope](https://www.modelscope.cn/models/itpossible/Chinese-Mistral-7B-Instruct-v0.1) | Instruct model <br> LoRA fine-tuned with Alpaca_GPT4 in both Chinese and English |
51
+ | **Chinese-Mistral** | Chinese-Mistral-7B-Instruct-v0.2 | [HuggingFace](https://huggingface.co/itpossible/Chinese-Mistral-7B-Instruct-v0.2)<br>[Wisemodel](https://wisemodel.cn/models/itpossible/Chinese-Mistral-7B-Instruct-v0.2) | Instruct model <br> LoRA fine-tuned with a million high-quality instructions |
52
+ | **PreparedLLM** | Prepared-Llama | [Huggingface](https://huggingface.co/itpossible/Prepared-Llama)<br>[Wisemodel](https://wisemodel.cn/models/itpossible/PREPARED-Llama) | Base model <br> Continual pretraining with a small number of geoscience data <br> Recommended to use JiuZhou |
53
+
54
+
55
+ ## Inference
56
+ Below is an example of inference code using JiuZhou-Instruct-v0.2.
57
+ ```python
58
+ import torch
59
+ from transformers import AutoTokenizer, AutoModelForCausalLM
60
+
61
+ device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
62
+
63
+ model_path = "itpossible/JiuZhou-Instruct-v0.2"
64
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
65
+ model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map=device)
66
+
67
+ text = "What is geoscience?"
68
+ messages = [{"role": "user", "content": text}]
69
+
70
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)
71
+ outputs_id = model.generate(inputs, max_new_tokens=600, do_sample=True)
72
+ outputs = tokenizer.batch_decode(outputs_id, skip_special_tokens=True)[0]
73
+ print(outputs)
74
+ ```
75
+
76
+ ## Model Performance
77
+
78
+ ### Geoscience Ability
79
+ We evaluate the performance of JiuZhou using the GeoBench benchmark.<br>
80
+ JiuZhou outperforms GPT-3.5 in objective tasks:
81
+ <p align="center">
82
+ <br>
83
+ <img src="image/objective_score.png" width="800"/>
84
+ <br>
85
+ </p>
86
+
87
+ JiuZhou also scores higher than ClimateChat across six criteria in subjective tasks:
88
+ <p align="center">
89
+ <br>
90
+ <img src="image/subjective_score.png" width="800"/>
91
+ <br>
92
+ </p>
93
+
94
+ ### General Ability
95
+
96
+ We evaluate the performance of Chinese-Mistral-7B using three benchmark datasets: C-Eval, CMMLU, and MMLU.<br>
97
+ Compared to other variants of Llama and Mistral models, JiuZhou shows outstanding performance:
98
+ <p align="center">
99
+ <br>
100
+ <img src="image/general_score.png" width="800"/>
101
+ <br>
102
+ </p>
103
+
104
+ ## Model Training Process
105
+
106
+ ### Training Corpus
107
+ The corpus consists of 50 million general documents and 3.4 million geoscience-related documents.
108
+ <p align="center">
109
+ <br>
110
+ <img src="image/JiuZhou-Corpus.png" width="800"/>
111
+ <br>
112
+ </p>
113
+
114
+ ### Training Framework
115
+ We use the JiuZhou-Framework proposed in this study.
116
+ <p align="center">
117
+ <br>
118
+ <img src="image/JiuZhou-Framework.png" width="800"/>
119
+ <br>
120
+ </p>
121
+
122
+ ### Two-stage Pre-adaptation Pre-training (TSPT)
123
+ TSPT improves the efficiency of using limited geoscience data and overcomes some of the technical bottlenecks in continual pretraining for LLMs.<br>
124
+ The difference between TSPT and single-stage training algorithms:
125
+ <p align="center">
126
+ <br>
127
+ <img src="image/TSPT.png" width="800"/>
128
+ <br>
129
+ </p>
130
+ Comparison of TSPT and one-stage pre-training algorithm performance:
131
+ <p align="center">
132
+ <br>
133
+ <img src="image/TSPT_score.png" width="800"/>
134
+ <br>
135
+ </p>
136
+
137
+
138
+ ## Model Training Code
139
+ We use [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) to fine-tune JiuZhou.
140
+
141
+ ### Project Deployment
142
+ ```bash
143
+ git clone https://github.com/THU-ESIS/JiuZhou.git
144
+ cd JiuZhou
145
+ pip install -e ".[torch,metrics]"
146
+ ```
147
+ ### Model Training
148
+ Pre-training:
149
+ ```bash
150
+ llamafactory-cli train examples/train_lora/JiuZhou_pretrain_sft.yaml
151
+ ```
152
+ Instruction-tuning:
153
+ ```bash
154
+ llamafactory-cli train examples/train_lora/JiuZhou_lora_sft.yaml
155
+ ```
156
+ Chat with the fine-tuned JiuZhou::
157
+ ```bash
158
+ llamafactory-cli chat examples/inference/JiuZhou_lora_sft.yaml
159
+ ```
160
+ Merge the instruction-tuned LoRA weights with the original JiuZhou weights:
161
+ ```bash
162
+ llamafactory-cli export examples/merge_lora/JiuZhou_lora_sft.yaml
163
+ ```
164
+
165
+ ## Citations
166
+ ```bibtex
167
+ @article{chen2024preparedllm,
168
+ author = {Chen, Zhou and Lin, Ming and Wang, Zimeng and Zang, Mingrun and Bai, Yuqi},
169
+ title = {PreparedLLM: Effective Pre-pretraining Framework for Domain-specific Large Language Models},
170
+ year = {2024},
171
+ journal = {Big Earth Data},
172
+ pages = {1--24},
173
+ doi = {10.1080/20964471.2024.2396159},
174
+ url = {https://doi.org/10.1080/20964471.2024.2396159}
175
+ }
176
+ ```
177
+
178
+ ## Acknowledgments
179
+ - [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)
180
+ - [OpenCompass](https://github.com/open-compass/opencompass)
181
+ - [K2](https://github.com/davendw49/k2)
182
+ - [GeoGalactica](https://github.com/geobrain-ai/geogalactica)
183
+ - [BB-GeoGPT](https://github.com/AGI-GIS/BB-GeoGPT)