stgzr commited on
Commit
6b72ec7
1 Parent(s): 0f8ae8c

update README.md

Browse files
Files changed (1) hide show
  1. README.md +260 -5
README.md CHANGED
@@ -1,5 +1,260 @@
1
- ---
2
- license: other
3
- license_name: inf
4
- license_link: LICENSE
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: inf
4
+ license_link: LICENSE
5
+ ---
6
+
7
+ <div align="center">
8
+ <img src="https://github.com/infly-ai/INF-LLM/blob/main/images/logo.png?raw=true" width="35%" alt="INF-34B" />
9
+ </div>
10
+ <hr>
11
+ <div align="center">
12
+
13
+ <a href="https://chat.infly.cn/" target="_blank">
14
+ <img alt="Chat" src="https://img.shields.io/badge/🤖%20Chat-INF%20LLM-536af5?color=536af5&logoColor=white" />
15
+ </a>
16
+ <a href="https://huggingface.co/infly" target="_blank">
17
+ <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-INF%20AI-ffc107?color=ffc107&logoColor=white" />
18
+ </a>
19
+ </div>
20
+
21
+
22
+ ## 1. Introduction
23
+
24
+ INF-34B has 34 billion parameters with a context window length of 32K, and is trained on about 3.5T well-processed tokens from English and Chinese bilingual corpus. Compared with open source models of the comparable size, INF-34B not only provides competitive performance in the OpenCompass evaluation, but also has impressive potential on both finance and healthcare domains. Besides, the quantized INF-34B runs on graphics cards of 24GB VRAM with negligible accuracy loss, which facilitates commercial applications, especially low-resource scenarios.
25
+
26
+ <div align="center">
27
+ <img src="https://github.com/infly-ai/INF-LLM/blob/main/images/teaser.png?raw=true" alt="result" width="100%">
28
+ </div>
29
+
30
+ - **Detailed for Training GPT Model:** We provide comprehensive details about our model pretraining and alignment, including high-quality data pipeline, instruction data preparation, and quantization results etc.
31
+
32
+ - **Superior Performance on Benchmarks:** We demonstrate superior performance of the INF-34B models by comparing against two competitors with comparable model size, Qwen1.5-32B and Yi1.5-34B, on the public OpenCompass benchmarks.
33
+
34
+
35
+ ## 2. Models
36
+
37
+ We release the base and chat models with 34B parameters based on the LLaMA framework, while using LayerNorm with zero-centered gamma instead of RMSNorm for training stability. Please **note** that you could use our models for commercial applications under the terms outlined in [License section](#6-license).
38
+
39
+ ### Huggingface
40
+
41
+ | Model | Sequence Length | Download |
42
+ |:---------------------:|:---------------:|:-----------------------------------------------------------------------:|
43
+ | INF-34B-Base | 32K | 🤗 [HuggingFace](https://huggingface.co/infly/INF-34B-Base) |
44
+ | Inf-34B-Chat | 32K | 🤗 [HuggingFace](https://huggingface.co/infly/INF-34B-Chat) |
45
+ | Inf-34B-Chat-GPTQ-4bits | 32K | 🤗 [HuggingFace](https://huggingface.co/infly/INF-34B-Chat-GPTQ-4bit) |
46
+ | Inf-34B-Chat-GPTQ-8bits | 32K | 🤗 [HuggingFace](https://huggingface.co/infly/INF-34B-Chat-GPTQ-8bit) |
47
+ | Inf-34B-Chat-AWQ | 32K | 🤗 [HuggingFace](https://huggingface.co/infly/INF-34B-Chat-AWQ) |
48
+
49
+
50
+ ## 3. Benchmarks
51
+
52
+ **Note:** If you want to reproduce the evaluation results, please refer to the [details of evaluation](https://github.com/infly-ai/INF-LLM/blob/main/evaluation/Evaluation.md), including prompts, postprocess scripts and version of inference frameworks.
53
+
54
+ ### Base Model
55
+
56
+ We evaluate our model on several academic benchmarks then compare with other similar-sized open access model. INF-34B has stronger performance in the fields that we chose to optimize while simultaneously preserves the general capabilities of LLM such as commonsense, world knowledge, math and coding.
57
+
58
+ | model | QWen1.5-32B | Yi1.5-34B | INF-34B |
59
+ |:---------------:|:-------------:|:------------:|:------:|
60
+ | MMLU(5-shot) | 73.60 | 77.86 | 76.11 |
61
+ | CMMLU(5-shot) | 81.87 | 81.85 | 80.08 |
62
+ | GSM8K(4-shot) | 72.86 | 80.06 | 83.02 |
63
+ | MATH(4-shot) | 36.80 | 33.88 | 38.34 |
64
+ | HumanEval(0-shot) | 44.51 | 47.56 | 65.24 |
65
+ | MBPP(3-shot) | 51.00 | 65.60 | 64.00 |
66
+ | BBH(3-shot) | 70.60 | 74.83 | 71.20 |
67
+ | HellaSwag(0-shot) | 82.03 | 81.57 | 83.32 |
68
+
69
+
70
+ **Note:** To facilitate reproduction, the results of common benchmarks are generated by [OpenCompass](https://github.com/open-compass/opencompass) except humaneval and mbpp as we experience code timeout and postprocess issues. Besides, Usmle and CFA is evaluated using internal evaluation scripts.
71
+
72
+ ### Chat Model
73
+
74
+ We present the performance results of our chat model and other LLM on various standard benchmarks, as well as two domain-specific benchmarks.
75
+
76
+ | model | QWen1.5-32B-Chat | Yi1.5-34B-Chat | INF-34B-Chat |
77
+ |:---------------:|:-------------:|:------------:|:------:|
78
+ | MT-bench | 8.3 | 8.5 | 8.3 |
79
+ | AlignBench | 7.1 | 7.2 | 7.1 |
80
+ | IFEval | 49.54 | 58.04 | 59.70 |
81
+ | Arena-Hard | 24.2 | 42.6 | 43.1 |
82
+ | GSM8K | 81.42 | 79.45 | 84.04 |
83
+ | MATH | 42.28 | 54.06 | 51.48 |
84
+ | USMLE | 58.70 | 55.84 | 79.70 |
85
+ | CFA 2.0 | 35.5 | 42.5 | 62.75 |
86
+
87
+
88
+ ### Long Context
89
+
90
+ We employed a long context SFT dataset of various length. Specifically, 37.7% shorter than 8k tokens, 40.5% falling within 8k to 16k tokens and 21.8% ranging from 16k to 32k tokens. And Our model has demonstrated superior performance on LongBench(via [OpenCompass](https://github.com/open-compass/opencompass)) tasks compared to Qwen1.5-32B.
91
+
92
+
93
+ | model | Single-Doc<br>QA | Multi-Doc<br>QA | Summari-<br>zation | Few-shot<br>Learning | Synthetic | Code |
94
+ |:---------------:|:-------------:|:------------:|:------:|:------:|:------:|:------:|
95
+ | QWen1.5-32B-Chat | 45.6 | 40.4 | 23.1 | 52.6 | 67.3 | 43.8 |
96
+ | INF-34B-Chat | 47.4 | 43.2 | 24.1 | 66.0 | 66.8 | 57.2 |
97
+
98
+ **Note:** All the reported results on the table are the average of sub-tasks for different categories.
99
+
100
+ INF-34B-32k also performs well across context window lengths up to 32k on Single-Needle RetrievalTask(S-RT) as visualized below.
101
+ <div align="center">
102
+ <img src="https://github.com/infly-ai/INF-LLM/blob/main/images/srt.png?raw=true" alt="SRT Results" width="100%">
103
+ </div>
104
+
105
+ ## 4. Training Details
106
+
107
+ ### Data Pipeline
108
+
109
+ We propose different data pipelines for general, domain and code data to ensure the richness, variety and quality of training samples. The general data pipeline involves general processing methods. For the domain data of interest, e.g., math, wiki, code, we propose a domain-specific data pipeline to extract the domain data from Common Crawl (CC). We also devise a code-specific pipeline to handle massive code data, since the code data has proven its effectiveness in improving the model’s reasoning and comprehension ability.
110
+
111
+ - **General Data Pipeline**: Our text cleaning pipeline mainly includes two stages: filtering and deduplication. The filtering involves language identification, URL filtering, and heuristic filtering rules. The deduplication includes both fuzzy deduplication and exact deduplication techniques.
112
+
113
+ - **Domain Data Pipeline**: We propose an iterative high-quality data retrieval method that recalls relevant data from the Common Crawl (CC) dataset for various target domains. It comprises three main component: FastText training, Performing recall and Human annotation.
114
+
115
+ - **Code Data Pipeline**: The code-specific data processing pipeline includes modules of preprocessing, heuristic filtering, deduplication, transformation and data mixture.
116
+
117
+ ### Training Settings
118
+
119
+ The architecture choices of INF-34B follows LLaMA framework. Specifically, we opt Rotary Embedding for positional encoding, SwiGLU for activation function, Grouped Query Attention (GQA) and LayerNorm with zero-centered gamma instead of RMSNorm for training stability.
120
+
121
+ Motivated by the idea of first training on relatively large but less polished corpus to equip the model with language understanding and world knowledge and then improves model’s domain knowledge and reasoning ability, our training process is split into 3 stages:
122
+ - Stage 1: The dataset mainly includes web text, paper, Wikipedia and source code. In this early stage, we aim at larger data and higher diversity.
123
+ - Stage 2: For second stage we seek to gradually challenge the model with longer and more complex texts. We up-weight long texts in the same data distribution of stage 1. We tune the rope base and extend our context window to 32k for more sophisticated comprehension of human knowledge.
124
+ - Stage 3: The final stage is composed of domain data recalled from Web text and synthetic data.
125
+
126
+ <div align="center">
127
+ <img src="https://github.com/infly-ai/INF-LLM/blob/main/images/setting.png?raw=true" alt="result" width="70%">
128
+ </div>
129
+
130
+ ## 5. Inference
131
+
132
+ This repo contains GPTQ model file, which are made with [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ).
133
+
134
+ <details>
135
+ <summary>Explanation of GPTQ parameters</summary>
136
+
137
+ - Bits: The bit size of the quantised model.
138
+ - GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
139
+ - Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
140
+ - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
141
+ </details>
142
+
143
+ We provide the inference examples from Python code. You can then use the following code.
144
+
145
+ ### Installation
146
+
147
+ Requires: The environment of the Hugging Face transformers is:
148
+ - Pytorch 2.3.0+cu121
149
+ - Flash Attention 2.5.0
150
+ - Transformers 4.42.4
151
+
152
+ ```shell
153
+ pip3 install transformers optimum
154
+ pip3 uninstall -y auto-gptq
155
+ git clone https://github.com/infly-ai/AutoGPTQ
156
+ cd AutoGPTQ
157
+ git checkout inflm
158
+ pip3 install .
159
+ # If you are compiling on an A800/H800 GPU, you can add the environment 'export TORCH_CUDA_ARCH_LIST="8.0;9.0"'
160
+ ```
161
+
162
+ ### Inference with Huggingface's Transformers
163
+
164
+ We provide the inference examples with [Huggingface's Transformers](https://github.com/huggingface/transformers).
165
+
166
+ ```python
167
+ from transformers import AutoModelForCausalLM, AutoTokenizer
168
+ import time
169
+
170
+ model_path="path/to/model/"
171
+ device = "cuda" # the device to load the model onto
172
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
173
+ model = AutoModelForCausalLM.from_pretrained(
174
+ model_path,
175
+ device_map="auto",
176
+ trust_remote_code=True
177
+ )
178
+
179
+ prompt = "Write a resume for a fresh high school graduate who is seeking their first job. Make sure to include at least 12 placeholder represented by square brackets, such as [address], [name]."
180
+ messages = [
181
+ {"role": "user", "content": prompt}
182
+ ]
183
+ text = tokenizer.apply_chat_template(
184
+ messages,
185
+ tokenize=False,
186
+ add_generation_prompt=True
187
+ )
188
+
189
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
190
+
191
+ context_time = 0
192
+ start = time.time()
193
+ generated_ids = model.generate(
194
+ model_inputs.input_ids,
195
+ max_new_tokens=200
196
+ )
197
+ context_time += time.time() - start
198
+
199
+ generated_ids = [
200
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
201
+ ]
202
+
203
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
204
+ print("=> response: ", response)
205
+ ```
206
+
207
+ ### Inference with AutoGPTQ
208
+
209
+ The above files provided are tested to work with Transformers. AutoGPTQ can also be used directly.
210
+ **Note**: If you encounter RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 during inference with transformers, or the generation token contains `<unk>`, we **strongly recommend** that you use the following method to run the code.
211
+
212
+ ```python
213
+ import torch
214
+ from auto_gptq import AutoGPTQForCausalLM
215
+ from transformers import AutoTokenizer
216
+
217
+ model_path = "/path/to/model/"
218
+
219
+ device = "cuda:0"
220
+ tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True, trust_remote_code=True)
221
+ tokenizer.pad_token = tokenizer.eos_token
222
+ tokenizer.padding_side = "left"
223
+
224
+ model = AutoGPTQForCausalLM.from_quantized(
225
+ model_path,
226
+ inject_fused_attention=False,
227
+ inject_fused_mlp=False,
228
+ device=device,
229
+ trust_remote_code=True,
230
+ use_marlin=True, # marlin kernel slove <unk> for gptq-4bit, for gptq-8bit set 'use_marlin=False'
231
+ use_triton=False, # for gptq-8bit set 'use_triton=True'
232
+ )
233
+ model.eval()
234
+
235
+ prompts = [
236
+ "I would like to",
237
+ "北京的冬天很冷,广州的夏天很热。",
238
+ "I have a dream that",
239
+ "To be or not to be, that is the question.",
240
+ ]
241
+
242
+ token_dict = tokenizer(prompts, return_tensors="pt", padding="longest").to(device)
243
+ with torch.inference_mode():
244
+ output_ids = model.generate(**token_dict, max_new_tokens=200)
245
+ output_ids_cut = output_ids[:, token_dict["input_ids"].shape[1] :]
246
+
247
+ for nb, output_id in enumerate(output_ids_cut):
248
+ print(f"Prompt {nb}: {prompts[nb]}")
249
+ print(f"Generated: {tokenizer.decode(output_id, skip_special_tokens=False)}")
250
+ print('*'*40)
251
+ ```
252
+
253
+
254
+ ## 6. License
255
+
256
+ INF-34B series (including Base and Chat) support commercial applications under a permissive [License](https://github.com/infly-ai/INF-LLM/blob/main/LICENSE).
257
+
258
+ ## 7. Contact
259
+
260
+ If you have any questions or seek for cooperation, please contact us at [[email protected]](mailto:[email protected]).