File size: 6,236 Bytes
22fb4ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
速度基准
========

我们在训练速度方面与
`LLaMA-Factory <https://github.com/hiyouga/LLaMA-Factory>`__
进行了对比。对比所使用的 LLaMA-Factory commit id 为
`8e04794 <https://github.com/hiyouga/LLaMA-Factory/tree/8e04794b2da067a4123b9d7091a54c5647f44244>`__\ 。使用
`Alpaca <https://huggingface.co/datasets/tatsu-lab/alpaca>`__
作为训练数据集测试速度。

硬件
----

-  NVIDIA A100-SXM4-80GB GPUs

-  Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz

软件环境
--------

-  Python 3.10

-  PyTorch 1.13

-  CUDA 11.7

-  CUDNN 8.5

-  NCCL 2.14.3

速度
----

|image1|

|image2|

|image3|

.. tip::
  TGS 全称是 Tokens per GPU per Second,每张 GPU 每秒训练的 Token 数

.. raw:: html

   <html xmlns="http://www.w3.org/1999/xhtml"><head></head><body><div align="center"></div></body></html>

.. list-table::
  :widths: 30 15 20 20 20 50
  :header-rows: 1

  * - 模型
    - GPUs
    - 序列长度
    - TGS
    - TFLOPs
    - Config
  * - Llama2-7B
    - 8
    - 8k
    - 3028.3
    - 185.3
    - `llama2_70b_full_alpaca_enzh_8k_sp1.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_7b/llama2_7b_full_alpaca_enzh_8k_sp1.py>`_
  * - Llama2-7B
    - 8
    - 32k
    - 2234.2
    - 193.0
    - `llama2_7b_full_alpaca_enzh_32k_sp1.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_7b/llama2_7b_full_alpaca_enzh_32k_sp1.py>`_
  * - Llama2-7B
    - 8
    - 128k
    - 948.6
    - 180.3
    - `llama2_7b_full_alpaca_enzh_128k_sp8.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_7b/llama2_7b_full_alpaca_enzh_128k_sp8.py>`_
  * - Llama2-7B
    - 8
    - 256k
    - 540.1
    - 176.9
    - `llama2_7b_full_alpaca_enzh_256k_sp8.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_7b/llama2_7b_full_alpaca_enzh_256k_sp8.py>`_
  * - Llama2-7B
    - 32
    - 1M
    - 133.6
    - 153.9
    - `llama2_7b_full_alpaca_enzh_1M_sp16.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_7b/llama2_7b_full_alpaca_enzh_1M_sp16.py>`_

.. list-table::
  :widths: 30 15 20 20 20 50
  :header-rows: 1

  * - 模型
    - GPUs
    - 序列长度
    - TGS
    - TFLOPs
    - Config
  * - Yi-34B-200K
    - 32
    - 8k
    - 485.1
    - 165.6
    - `yi_34b_200k_full_alpaca_enzh_8k_sp1.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/yi_34b/yi_34b_200k_full_alpaca_enzh_8k_sp1.py>`_
  * - Yi-34B-200K
    - 32
    - 32k
    - 491.5
    - 209.1
    - `yi_34b_200k_full_alpaca_enzh_32k_sp2.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/yi_34b/yi_34b_200k_full_alpaca_enzh_32k_sp2.py>`_
  * - Yi-34B-200K
    - 32
    - 128k
    - 251.1
    - 191.8
    - `yi_34b_200k_full_alpaca_enzh_128k_sp8.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/yi_34b/yi_34b_200k_full_alpaca_enzh_128k_sp8.py>`_
  * - Yi-34B-200K
    - 32
    - 256k
    - 119.7
    - 145.3
    - `yi_34b_200k_full_alpaca_enzh_256k_sp8.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/yi_34b/yi_34b_200k_full_alpaca_enzh_256k_sp8.py>`_

.. list-table::
  :widths: 30 15 20 20 20 50
  :header-rows: 1

  * - 模型
    - GPUs
    - 序列长度
    - TGS
    - TFLOPs
    - Config
  * - Llama2-70B
    - 32
    - 8k
    - 216.8
    - 144.7
    - `llama2_70b_full_alpaca_enzh_8k_sp1.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_70b/llama2_70b_full_alpaca_enzh_8k_sp1.py>`_
  * - Llama2-70B
    - 32
    - 32k
    - 300.9
    - 239.6
    - `llama2_70b_full_alpaca_enzh_32k_sp4.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_70b/llama2_70b_full_alpaca_enzh_32k_sp4.py>`_
  * - Llama2-70B
    - 32
    - 128k
    - 144.7
    - 189.7
    - `llama2_70b_full_alpaca_enzh_128k_sp8.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_70b/llama2_70b_full_alpaca_enzh_128k_sp8.py>`_
  * - Llama2-70B
    - 32
    - 256k
    - 63.8
    - 127.6
    - `llama2_70b_full_alpaca_enzh_256k_sp16.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_70b/llama2_70b_full_alpaca_enzh_256k_sp16.py>`_
  * - Llama2-70B
    - 64
    - 1M
    - 21.8
    - 133.5
    - `llama2_70b_full_alpaca_enzh_1M_sp64.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_70b/llama2_70b_full_alpaca_enzh_1M_sp64.py>`_

.. note::
  所有实验都会将 Alpaca 数据集拼接为最大长度。由于 Alpaca 数据集所含
  token 数较少,无法拼接成超长序列(如 1M
  长度),因此当序列长度较长时,会对 XTuner 代码进行如下修改:

  .. code:: diff

    # xtuner/dataset/huggingface.py
    def build_origin_dataset(dataset, split):
        ...
    +   # 6 times larger dataset (for speed testing purposes only)
    +   dataset = concatenate_datasets([dataset for _ in range(6)])
        return dataset

    def pack_dataset(dataset, max_length, use_varlen_attn, shuffle_before_pack,
                      map_num_proc):
        dataset = dataset.map(
            Packer(max_length, use_varlen_attn=use_varlen_attn),
            batched=True,
    -       num_proc=map_num_proc
    +       batch_size=25000,
    +       num_proc=1
        )
        return dataset


.. note::
  由于 Alpaca 数据量较小,因此做了第一处修改将数据集大小扩大了 6
  倍,以保证拥有足够的训练 iter 数(保证速度测试的稳定性)。另外,由于
  Alpaca
  数据集每条数据的长度较短,因此在数据拼接的时候做了第二处修改以保证拥有足够多的数据,足以拼接为
  ``max_length`` 最大长度。

.. |image1| image:: https://github.com/InternLM/xtuner/assets/41630003/c9c05dbd-0806-4fb2-9da9-62f04b150f7c
.. |image2| image:: https://github.com/InternLM/xtuner/assets/41630003/3ef6308c-595b-4624-b56d-a8737a1f2261
.. |image3| image:: https://github.com/InternLM/xtuner/assets/41630003/ba16368e-e5f7-41eb-89ed-1140a8633134