JustinLin610 commited on
Commit
ad78953
·
1 Parent(s): 33be903

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -43
README.md CHANGED
@@ -108,10 +108,10 @@ For more information, please refer to our [GitHub repo](https://github.com/QwenL
108
 
109
  We illustrate the zero-shot performance of both BF16 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
110
 
111
- | Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
112
- | ------------- | :--------: | :----------: | :----: | :--------: |
113
- | BF16 | 55.8 | 59.7 | 50.3 | 37.2 |
114
- | Int4 | 55.1 | 59.2 | 49.7 | 35.4 |
115
 
116
  ### 推理速度 (Inference Speed)
117
 
@@ -119,10 +119,10 @@ We illustrate the zero-shot performance of both BF16 and Int4 models on the benc
119
 
120
  We measured the average inference speed of generating 2048 and 8192 tokens under BF16 precision and Int4 quantization level, respectively.
121
 
122
- | Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
123
- | ------------- | :------------------:| :------------------:|
124
- | BF16 | 30.53 | 28.51 |
125
- | Int4 | 45.60 | 33.83 |
126
 
127
  具体而言,我们记录在长度为1的上下文的条件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU,使用PyTorch 2.0.1和CUDA 11.4。推理速度是生成8192个token的速度均值。
128
 
@@ -135,7 +135,7 @@ In detail, the setting of profiling is generating 8192 new tokens with 1 context
135
  We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int4 quantization level, respectively. The results are shown below.
136
 
137
  | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
138
- | ------------------ | :---------------------------------: | :-----------------------------------: |
139
  | BF16 | 18.99GB | 24.40GB |
140
  | Int4 | 10.20GB | 15.61GB |
141
 
@@ -162,12 +162,12 @@ Our tokenizer based on tiktoken is different from other tokenizers, e.g., senten
162
  The details of the model architecture of Qwen-7B-Chat are listed as follows:
163
 
164
  | Hyperparameter | Value |
165
- | :------------- | :----: |
166
- | n_layers | 32 |
167
- | n_heads | 32 |
168
- | d_model | 4096 |
169
  | vocab size | 151851 |
170
- | sequence length | 8192 |
171
 
172
  在位置编码、FFN激活函数和normalization的实现方式上,我们也采用了目前最流行的做法,
173
  即RoPE相对位置编码、SwiGLU激活函数、RMSNorm(可选安装flash-attention加速)。
@@ -204,7 +204,7 @@ Note: Due to rounding errors caused by hardware and framework, differences in re
204
  We demonstrate the 0-shot & 5-shot accuracy of Qwen-7B-Chat on C-Eval validation set
205
 
206
  | Model | Avg. Acc. |
207
- |:--------------------------------:| :-------: |
208
  | LLaMA2-7B-Chat | 31.9 |
209
  | LLaMA2-13B-Chat | 36.2 |
210
  | LLaMA2-70B-Chat | 44.3 |
@@ -246,7 +246,7 @@ The 0-shot & 5-shot accuracy of Qwen-7B-Chat on MMLU is provided below.
246
  The performance of Qwen-7B-Chat still on the top between other human-aligned models with comparable size.
247
 
248
  | Model | Avg. Acc. |
249
- |:--------------------------------:| :-------: |
250
  | ChatGLM2-6B-Chat | 46.0 |
251
  | LLaMA2-7B-Chat | 46.2 |
252
  | InternLM-7B-Chat | 51.1 |
@@ -266,18 +266,18 @@ Qwen-7B-Chat在[HumanEval](https://github.com/openai/human-eval)的zero-shot Pas
266
 
267
  The zero-shot Pass@1 of Qwen-7B-Chat on [HumanEval](https://github.com/openai/human-eval) is demonstrated below
268
 
269
- | Model | Pass@1 |
270
- |:-----------------------:| :-------: |
271
- | ChatGLM2-6B-Chat | 11.0 |
272
- | LLaMA2-7B-Chat | 12.2 |
273
- | InternLM-7B-Chat | 14.6 |
274
- | Baichuan2-7B-Chat | 13.4 |
275
- | LLaMA2-13B-Chat | 18.9 |
276
- | Baichuan2-13B-Chat | 17.7 |
277
- | LLaMA2-70B-Chat | 32.3 |
278
- | Qwen-7B-Chat (original) | 24.4 |
279
- | **Qwen-7B-Chat** | 37.2 |
280
- | **Qwen-14B-Chat** | **43.9** |
281
 
282
  ### 数学评测(Mathematics Evaluation)
283
 
@@ -285,20 +285,20 @@ The zero-shot Pass@1 of Qwen-7B-Chat on [HumanEval](https://github.com/openai/hu
285
 
286
  The accuracy of Qwen-7B-Chat on GSM8K is shown below
287
 
288
- | Model | Acc. |
289
- |:--------------------------------:| :-------: |
290
- | LLaMA2-7B-Chat | 26.3 |
291
- | ChatGLM2-6B-Chat | 28.8 |
292
- | Baichuan2-7B-Chat | 32.8 |
293
- | InternLM-7B-Chat | 33.0 |
294
- | LLaMA2-13B-Chat | 37.1 |
295
- | Baichuan2-13B-Chat | 55.3 |
296
- | LLaMA2-70B-Chat | 59.3 |
297
- | Qwen-7B-Chat (original) (0-shot) | 41.1 |
298
- | **Qwen-7B-Chat (0-shot)** | 50.3 |
299
- | **Qwen-7B-Chat (8-shot)** | 54.1 |
300
- | **Qwen-14B-Chat (0-shot)** | **60.1** |
301
- | **Qwen-14B-Chat (8-shot)** | 59.3 |
302
 
303
  ### 长序列评测(Long-Context Understanding)
304
 
@@ -311,7 +311,7 @@ We introduce NTK-aware interpolation, LogN attention scaling to extend the conte
311
  **(To use these tricks, please set `use_dynamic_ntk` and `use_long_attn` to true in config.json.)**
312
 
313
  | Model | VCSUM (zh) |
314
- | :---------------- | :--------: |
315
  | GPT-3.5-Turbo-16k | 16.0 |
316
  | LLama2-7B-Chat | 0.2 |
317
  | InternLM-7B-Chat | 13.0 |
 
108
 
109
  We illustrate the zero-shot performance of both BF16 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
110
 
111
+ | Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
112
+ |--------------|:----:|:-----------:|:-----:|:---------:|
113
+ | BF16 | 55.8 | 59.7 | 50.3 | 37.2 |
114
+ | Int4 | 55.1 | 59.2 | 49.7 | 35.4 |
115
 
116
  ### 推理速度 (Inference Speed)
117
 
 
119
 
120
  We measured the average inference speed of generating 2048 and 8192 tokens under BF16 precision and Int4 quantization level, respectively.
121
 
122
+ | Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
123
+ |--------------|:-------------------:|:-------------------:|
124
+ | BF16 | 30.53 | 28.51 |
125
+ | Int4 | 45.60 | 33.83 |
126
 
127
  具体而言,我们记录在长度为1的上下文的条件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU,使用PyTorch 2.0.1和CUDA 11.4。推理速度是生成8192个token的速度均值。
128
 
 
135
  We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int4 quantization level, respectively. The results are shown below.
136
 
137
  | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
138
+ |--------------------|:-----------------------------------:|:-------------------------------------:|
139
  | BF16 | 18.99GB | 24.40GB |
140
  | Int4 | 10.20GB | 15.61GB |
141
 
 
162
  The details of the model architecture of Qwen-7B-Chat are listed as follows:
163
 
164
  | Hyperparameter | Value |
165
+ |:----------------|:------:|
166
+ | n_layers | 32 |
167
+ | n_heads | 32 |
168
+ | d_model | 4096 |
169
  | vocab size | 151851 |
170
+ | sequence length | 8192 |
171
 
172
  在位置编码、FFN激活函数和normalization的实现方式上,我们也采用了目前最流行的做法,
173
  即RoPE相对位置编码、SwiGLU激活函数、RMSNorm(可选安装flash-attention加速)。
 
204
  We demonstrate the 0-shot & 5-shot accuracy of Qwen-7B-Chat on C-Eval validation set
205
 
206
  | Model | Avg. Acc. |
207
+ |:--------------------------------:|:---------:|
208
  | LLaMA2-7B-Chat | 31.9 |
209
  | LLaMA2-13B-Chat | 36.2 |
210
  | LLaMA2-70B-Chat | 44.3 |
 
246
  The performance of Qwen-7B-Chat still on the top between other human-aligned models with comparable size.
247
 
248
  | Model | Avg. Acc. |
249
+ |:--------------------------------:|:---------:|
250
  | ChatGLM2-6B-Chat | 46.0 |
251
  | LLaMA2-7B-Chat | 46.2 |
252
  | InternLM-7B-Chat | 51.1 |
 
266
 
267
  The zero-shot Pass@1 of Qwen-7B-Chat on [HumanEval](https://github.com/openai/human-eval) is demonstrated below
268
 
269
+ | Model | Pass@1 |
270
+ |:-----------------------:|:--------:|
271
+ | ChatGLM2-6B-Chat | 11.0 |
272
+ | LLaMA2-7B-Chat | 12.2 |
273
+ | InternLM-7B-Chat | 14.6 |
274
+ | Baichuan2-7B-Chat | 13.4 |
275
+ | LLaMA2-13B-Chat | 18.9 |
276
+ | Baichuan2-13B-Chat | 17.7 |
277
+ | LLaMA2-70B-Chat | 32.3 |
278
+ | Qwen-7B-Chat (original) | 24.4 |
279
+ | **Qwen-7B-Chat** | 37.2 |
280
+ | **Qwen-14B-Chat** | **43.9** |
281
 
282
  ### 数学评测(Mathematics Evaluation)
283
 
 
285
 
286
  The accuracy of Qwen-7B-Chat on GSM8K is shown below
287
 
288
+ | Model | Acc. |
289
+ |:--------------------------------:|:--------:|
290
+ | LLaMA2-7B-Chat | 26.3 |
291
+ | ChatGLM2-6B-Chat | 28.8 |
292
+ | Baichuan2-7B-Chat | 32.8 |
293
+ | InternLM-7B-Chat | 33.0 |
294
+ | LLaMA2-13B-Chat | 37.1 |
295
+ | Baichuan2-13B-Chat | 55.3 |
296
+ | LLaMA2-70B-Chat | 59.3 |
297
+ | Qwen-7B-Chat (original) (0-shot) | 41.1 |
298
+ | **Qwen-7B-Chat (0-shot)** | 50.3 |
299
+ | **Qwen-7B-Chat (8-shot)** | 54.1 |
300
+ | **Qwen-14B-Chat (0-shot)** | **60.1** |
301
+ | **Qwen-14B-Chat (8-shot)** | 59.3 |
302
 
303
  ### 长序列评测(Long-Context Understanding)
304
 
 
311
  **(To use these tricks, please set `use_dynamic_ntk` and `use_long_attn` to true in config.json.)**
312
 
313
  | Model | VCSUM (zh) |
314
+ |:------------------|:----------:|
315
  | GPT-3.5-Turbo-16k | 16.0 |
316
  | LLama2-7B-Chat | 0.2 |
317
  | InternLM-7B-Chat | 13.0 |