Spaces:

kai119
/

llama

Running

App Files Files Community

llama / data /xtuner /docs /zh_cn /acceleration /benchmark.rst

kai119

Upload folder using huggingface_hub

22fb4ec verified 3 months ago

raw

history blame contribute delete

6.24 kB

	速度基准
	========

	我们在训练速度方面与
	`LLaMA-Factory <https://github.com/hiyouga/LLaMA-Factory>`__
	进行了对比。对比所使用的 LLaMA-Factory commit id 为
	`8e04794 <https://github.com/hiyouga/LLaMA-Factory/tree/8e04794b2da067a4123b9d7091a54c5647f44244>`__\ 。使用
	`Alpaca <https://huggingface.co/datasets/tatsu-lab/alpaca>`__
	作为训练数据集测试速度。

	硬件
	----

	- NVIDIA A100-SXM4-80GB GPUs

	- Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz

	软件环境
	--------

	- Python 3.10

	- PyTorch 1.13

	- CUDA 11.7

	- CUDNN 8.5

	- NCCL 2.14.3

	速度
	----

	\|image1\|

	\|image2\|

	\|image3\|

	.. tip::
	TGS 全称是 Tokens per GPU per Second，每张 GPU 每秒训练的 Token 数

	.. raw:: html

	<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body><div align="center"></div></body></html>

	.. list-table::
	:widths: 30 15 20 20 20 50
	:header-rows: 1

	* - 模型
	- GPUs
	- 序列长度
	- TGS
	- TFLOPs
	- Config
	* - Llama2-7B
	- 8
	- 8k
	- 3028.3
	- 185.3
	- `llama2_70b_full_alpaca_enzh_8k_sp1.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_7b/llama2_7b_full_alpaca_enzh_8k_sp1.py>`_
	* - Llama2-7B
	- 8
	- 32k
	- 2234.2
	- 193.0
	- `llama2_7b_full_alpaca_enzh_32k_sp1.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_7b/llama2_7b_full_alpaca_enzh_32k_sp1.py>`_
	* - Llama2-7B
	- 8
	- 128k
	- 948.6
	- 180.3
	- `llama2_7b_full_alpaca_enzh_128k_sp8.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_7b/llama2_7b_full_alpaca_enzh_128k_sp8.py>`_
	* - Llama2-7B
	- 8
	- 256k
	- 540.1
	- 176.9
	- `llama2_7b_full_alpaca_enzh_256k_sp8.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_7b/llama2_7b_full_alpaca_enzh_256k_sp8.py>`_
	* - Llama2-7B
	- 32
	- 1M
	- 133.6
	- 153.9
	- `llama2_7b_full_alpaca_enzh_1M_sp16.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_7b/llama2_7b_full_alpaca_enzh_1M_sp16.py>`_

	.. list-table::
	:widths: 30 15 20 20 20 50
	:header-rows: 1

	* - 模型
	- GPUs
	- 序列长度
	- TGS
	- TFLOPs
	- Config
	* - Yi-34B-200K
	- 32
	- 8k
	- 485.1
	- 165.6
	- `yi_34b_200k_full_alpaca_enzh_8k_sp1.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/yi_34b/yi_34b_200k_full_alpaca_enzh_8k_sp1.py>`_
	* - Yi-34B-200K
	- 32
	- 32k
	- 491.5
	- 209.1
	- `yi_34b_200k_full_alpaca_enzh_32k_sp2.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/yi_34b/yi_34b_200k_full_alpaca_enzh_32k_sp2.py>`_
	* - Yi-34B-200K
	- 32
	- 128k
	- 251.1
	- 191.8
	- `yi_34b_200k_full_alpaca_enzh_128k_sp8.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/yi_34b/yi_34b_200k_full_alpaca_enzh_128k_sp8.py>`_
	* - Yi-34B-200K
	- 32
	- 256k
	- 119.7
	- 145.3
	- `yi_34b_200k_full_alpaca_enzh_256k_sp8.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/yi_34b/yi_34b_200k_full_alpaca_enzh_256k_sp8.py>`_

	.. list-table::
	:widths: 30 15 20 20 20 50
	:header-rows: 1

	* - 模型
	- GPUs
	- 序列长度
	- TGS
	- TFLOPs
	- Config
	* - Llama2-70B
	- 32
	- 8k
	- 216.8
	- 144.7
	- `llama2_70b_full_alpaca_enzh_8k_sp1.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_70b/llama2_70b_full_alpaca_enzh_8k_sp1.py>`_
	* - Llama2-70B
	- 32
	- 32k
	- 300.9
	- 239.6
	- `llama2_70b_full_alpaca_enzh_32k_sp4.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_70b/llama2_70b_full_alpaca_enzh_32k_sp4.py>`_
	* - Llama2-70B
	- 32
	- 128k
	- 144.7
	- 189.7
	- `llama2_70b_full_alpaca_enzh_128k_sp8.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_70b/llama2_70b_full_alpaca_enzh_128k_sp8.py>`_
	* - Llama2-70B
	- 32
	- 256k
	- 63.8
	- 127.6
	- `llama2_70b_full_alpaca_enzh_256k_sp16.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_70b/llama2_70b_full_alpaca_enzh_256k_sp16.py>`_
	* - Llama2-70B
	- 64
	- 1M
	- 21.8
	- 133.5
	- `llama2_70b_full_alpaca_enzh_1M_sp64.py <https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llama_speed_benchmark/llama2_70b/llama2_70b_full_alpaca_enzh_1M_sp64.py>`_

	.. note::
	所有实验都会将 Alpaca 数据集拼接为最大长度。由于 Alpaca 数据集所含
	token 数较少，无法拼接成超长序列（如 1M
	长度），因此当序列长度较长时，会对 XTuner 代码进行如下修改：

	.. code:: diff

	# xtuner/dataset/huggingface.py
	def build_origin_dataset(dataset, split):
	...
	+ # 6 times larger dataset (for speed testing purposes only)
	+ dataset = concatenate_datasets([dataset for _ in range(6)])
	return dataset

	def pack_dataset(dataset, max_length, use_varlen_attn, shuffle_before_pack,
	map_num_proc):
	dataset = dataset.map(
	Packer(max_length, use_varlen_attn=use_varlen_attn),
	batched=True,
	- num_proc=map_num_proc
	+ batch_size=25000,
	+ num_proc=1
	)
	return dataset


	.. note::
	由于 Alpaca 数据量较小，因此做了第一处修改将数据集大小扩大了 6
	倍，以保证拥有足够的训练 iter 数（保证速度测试的稳定性）。另外，由于
	Alpaca
	数据集每条数据的长度较短，因此在数据拼接的时候做了第二处修改以保证拥有足够多的数据，足以拼接为
	``max_length`` 最大长度。

	.. \|image1\| image:: https://github.com/InternLM/xtuner/assets/41630003/c9c05dbd-0806-4fb2-9da9-62f04b150f7c
	.. \|image2\| image:: https://github.com/InternLM/xtuner/assets/41630003/3ef6308c-595b-4624-b56d-a8737a1f2261
	.. \|image3\| image:: https://github.com/InternLM/xtuner/assets/41630003/ba16368e-e5f7-41eb-89ed-1140a8633134