- LLM/LVM is Data-hungry
- Streaming Data Flow
- Scaling Exact Attention
Modeling on Internet-scale Data
Bingyi Jing@ML-Summit
Apr 25th, 2024
LLM/LVM is Data-hungry
对于相同的文本输入,不同的任务需要不同的标注数据和模型。
You
请问“我今天去国家大剧院看了一场精彩的演出”表达了怎样的情绪?
积极
You
请抽取以下文本中出现的时间、地点、任务和发生的时间:“我今天去国家大剧院看了一场精彩的演出”。
在您提供的文本中:
You
将“我今天去国家大剧院看了一场精彩的演出”翻译成英文。
“I went to the National Grand Theater today and watched a wonderful performance.”
数据集难以获取, 大小受限
任何文章都可以作为数据进行训练
不同模型之间不能共享知识
只需要一个模型
无标注的数据很多, 但是很难利用起来.
数据无需标注, 可以自然直接对文档进行训练.
The official datasets hosted on Hugging Face as of April 2024, categorized into a tree diagram by task type,
compared with the data used to pre-train GPT-3.
现代的大语言模型,需要远超传统NLP的数据进行预训练。
训练GPT-3使用了大约0.75TB的文本数据
训练GPT-3使用了大约0.75TB的文本数据
这样的训练量在如今看来并不算多
训练GPT-3使用了大约0.75TB的文本数据
这样的训练量在如今看来并不算多
How many data SORA uses?
We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data 1
一个可供对比的数据量是:每分钟上传至 YouTube 的视频是 500h 的量级。则近五年的 YouTube 上的视频数据总量为:13亿小时 = 788亿分钟 。由于Diffusion模型训练text to video 需要高质量的标注视频,因此我们可以估计Sora 训练的视频量级为1亿分钟左右。
目前有一个比较准确的估计, 一分钟视频约为 1M tokens 。2
一个可供对比的数据量是:每分钟上传至 YouTube 的视频是 500h 的量级。则近五年的 YouTube 上的视频数据总量为:13亿小时 = 788亿分钟 。由于Diffusion模型训练text to video 需要高质量的标注视频,因此我们可以估计Sora 训练的视频量级为1亿分钟左右。
~500TB trained data
~500PB raw data
Training on
internet-
scale data
Modeling
ultra-long
sequence
Streaming Data Flow
传统的训练方式通常是一次性将数据下载到本地,然后进行处理。
import datasets
from transformers import AutoTokenizer
1dataset = datasets.load_dataset(
"rotten_tomatoes",
split="train",
)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
2dataset = datasets.map(
lambda examples: tokenizer(examples["text"]),
batched=True,
)
3...
这种范式下ETL与模型训练完全串行,是一种简单明了的方式。
多模态大模型的ETL流程正变得越来越复杂
多模态数据由于版权和存储原因,大多以下载链接的形式分发,获取速率受到限制
webvid以url形式提供, 共包括10730233条数据
处理流程复杂耗时,甚至超过训练开销
An aerial video sweeps over a majestic ocean cliff with striatedrock formations in rich hues of red, green, and orange.The sun’srays enhance the colorful palette of the landscape, while the sea’sazure watersgently crash aqainst the clif’s base. Visible are thetextured detais ofthe clif face and the contrast of the areeralaae and seaqrasscoating parts of the rock.Seabirds can beseen flying close to the water around the rocky outcrop. Thevideo conveys a serene yet dynamic coastal scenery, highlightingthe natural beauty and geological diversity of a rugged coastline
🌋 LLaVA-1.6-Yi-34B (3s/it)
A breathtakina aerial view of a rocky cliff iutting out into theocean.The cliff. adorned with vibrant qreen moss and patches ofyellow and red lichen, is bathed in the warm glow of the sun. Theocean, a deep blue, is dotted with small white waves that crashagainst the cliff’s base. The perspective of the video is fromabove, providing a bird’s eye view of the cliff and the surroundingocean.The sun’s rays illuminate the scene, casting a beautifullight on the cliff and the ocean. The video is a stunningrepresentation of the raw beauty of nature.
Storage plays an important role in AI training, and yet is one of the least talked-about aspects. As the GenAI training jobs become more multimodal over time, consuming large amounts of image, video, and text data, the need for data storage grows rapidly. 1
How to train
on internet-scale
data?
How to train
on internet-scale
data?
Just training on
the internet!
流式传输数据可以解决这些问题
但流式数据传输只是一个开始,我们需要构建完全基于流式数据的训练框架
使用S3作为数据和权重的存储后端, 无缝进行不同规模的云迁移
引入DPU集群,允许将数据直接传输到GPU, 消除内存数据库的开销
Scaling Exact Attention
Flash-Attn-2 | FP8 (H100) | 3D Parallel + Zero | Padding Free | Fused Kernel | Static Graph | TGS1 | |
---|---|---|---|---|---|---|---|
Platformers | ✔️ | ✔️ | ✔️ | ✔️ | 100% | ✔️ | 3743 |
Megatron-LM | ✖️ | ✔️ | ✔️ | ✖️ | 80% | ✖️ | 3581 |
Deepspeed | ✔️ | ✖️ | ✔️ | ✖️ | 60% | ✖️ | ✖️ |
Colossal-ai | ✖️ | ✖️ | ✔️ | ✖️ | 40% | ✖️ | 2610 |
Fireworks exploding in the sky
Waves crashing against the shore
A bustling street in London with red telephone booths and Big Ben in the background
Camera pans left to right on mago slices sitting on a table
Two balls thown in the air
Slow motion flower petals falling on the ground
A burning campire in a forest
A boat sailing on a stormy ocean
Thanks