Modeling on Internet-scale Data

Bingyi Jing@ML-Summit

Apr 25th, 2024


  • LLM/LVM is Data-hungry
  • Streaming Data Flow
  • Scaling Exact Attention

LLM/LVM is Data-hungry

Revisiting the Pre-GPT Era

对于相同的文本输入,不同的任务需要不同的标注数据和模型。

  • 情绪分析 (IMDB: 100k rows, 84.1MB)

flowchart LR
markdown["我今天去国家大剧院看了一场精彩的演出"]
newLines["积极"]
markdown --> newLines 

flowchart TD
A["我"]
B["今天去"]
C["国家大剧院"]
D["看了一场"]
E["精彩的演出"]
A-->AN[人物]
B-->BN[时间]
C-->CN["地点"]
E-->EN["事件"]

  • 文本翻译 (wmt19: 15M rows, 1.25GB)

flowchart TD
markdown["我今天去国家大剧院看了一场精彩的演出"]
newLines["
I went to the National Grand Theater today and watched a wonderful performance."]
markdown  <--> newLines 

Unified Task Paradigm


 You

请问“我今天去国家大剧院看了一场精彩的演出”表达了怎样的情绪?

ChatGPT Logo ChatGPT

积极

Unified Task Paradigm


 You

请抽取以下文本中出现的时间、地点、任务和发生的时间:“我今天去国家大剧院看了一场精彩的演出”。

ChatGPT Logo ChatGPT

在您提供的文本中:

  • 时间:今天
  • 地点:国家大剧院
  • 任务:看了一场精彩的演出
  • 发生的时间:不明确,但由“今天”可以推断是在今天发生的。

Unified Task Paradigm


 You

将“我今天去国家大剧院看了一场精彩的演出”翻译成英文。

ChatGPT Logo ChatGPT

“I went to the National Grand Theater today and watched a wonderful performance.”

Unified Task Paradigm

数据集难以获取, 大小受限

任何文章都可以作为数据进行训练

不同模型之间不能共享知识

只需要一个模型

无标注的数据很多, 但是很难利用起来.

数据无需标注, 可以自然直接对文档进行训练.

Pretrained models are data-hungry

The official datasets hosted on Hugging Face as of April 2024, categorized into a tree diagram by task type,
compared with the data used to pre-train GPT-3.

现代的大语言模型,需要远超传统NLP的数据进行预训练。

Pretrained models are data-hungry

训练GPT-3使用了大约0.75TB的文本数据

  • CommonCrawl 570GB
  • WebText 50GB
  • Wikipedia 11GB
  • Books 21GB
  • Acadamic Journals 101GB

Pretrained models are data-hungry

训练GPT-3使用了大约0.75TB的文本数据

这样的训练量在如今看来并不算多

Pretrained models are data-hungry

训练GPT-3使用了大约0.75TB的文本数据

这样的训练量在如今看来并不算多

Dawning of the World Model Era

How many data SORA uses?

We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data 1

一个可供对比的数据量是:每分钟上传至 YouTube 的视频是 500h 的量级。则近五年的 YouTube 上的视频数据总量为:13亿小时 = 788亿分钟 。由于Diffusion模型训练text to video 需要高质量的标注视频,因此我们可以估计Sora 训练的视频量级为1亿分钟左右。

目前有一个比较准确的估计, 一分钟视频约为 1M tokens 。2

Dawning of the World Model Era

一个可供对比的数据量是:每分钟上传至 YouTube 的视频是 500h 的量级。则近五年的 YouTube 上的视频数据总量为:13亿小时 = 788亿分钟 。由于Diffusion模型训练text to video 需要高质量的标注视频,因此我们可以估计Sora 训练的视频量级为1亿分钟左右。

~500TB trained data

~500PB raw data

Dawning of the World Model Era

Challenge

Training on
internet-
scale data

Modeling
ultra-long
sequence

Streaming Data Flow

Legacy training paradigm

传统的训练方式通常是一次性将数据下载到本地,然后进行处理。


import datasets
from transformers import AutoTokenizer

1dataset = datasets.load_dataset(
    "rotten_tomatoes",
    split="train",
)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

2dataset = datasets.map(
    lambda examples: tokenizer(examples["text"]),
    batched=True,
)

3...
1
下载数据集
2
将数据集处理为模型输入,并保存到本地
3
准备训练

Legacy training paradigm

这种范式下ETL与模型训练完全串行,是一种简单明了的方式。

What’s the Problem?

多模态大模型的ETL流程正变得越来越复杂

  • E: 数据模态多,来源复杂,拉取时间长
  • T: 数据处理流程复杂
  • L: 存储占用高

What’s the Problem?

多模态数据由于版权和存储原因,大多以下载链接的形式分发,获取速率受到限制

webvid以url形式提供, 共包括10730233条数据

What’s the Problem?

处理流程复杂耗时,甚至超过训练开销

ChatGPT Logo GPT-4V (20s/it)

An aerial video sweeps over a majestic ocean cliff with striatedrock formations in rich hues of red, green, and orange.The sun’srays enhance the colorful palette of the landscape, while the sea’sazure watersgently crash aqainst the clif’s base. Visible are thetextured detais ofthe clif face and the contrast of the areeralaae and seaqrasscoating parts of the rock.Seabirds can beseen flying close to the water around the rocky outcrop. Thevideo conveys a serene yet dynamic coastal scenery, highlightingthe natural beauty and geological diversity of a rugged coastline

🌋 LLaVA-1.6-Yi-34B (3s/it)

A breathtakina aerial view of a rocky cliff iutting out into theocean.The cliff. adorned with vibrant qreen moss and patches ofyellow and red lichen, is bathed in the warm glow of the sun. Theocean, a deep blue, is dotted with small white waves that crashagainst the cliff’s base. The perspective of the video is fromabove, providing a bird’s eye view of the cliff and the surroundingocean.The sun’s rays illuminate the scene, casting a beautifullight on the cliff and the ocean. The video is a stunningrepresentation of the raw beauty of nature.

What’s the Problem?

Storage plays an important role in AI training, and yet is one of the least talked-about aspects. As the GenAI training jobs become more multimodal over time, consuming large amounts of image, video, and text data, the need for data storage grows rapidly. 1

  • 要从原始数据中筛选出一亿分钟数据,可能意味着原始数据量高达数十PB以上
  • 对于一般的小型数据中心,没有能力搭建适应视频预训练的存储设施。

What’s the Problem?

  • 数据来源复杂
  • 数据源不能立即被拉取
  • 数据处理流程复杂
  • 数据处理和模型训练耦合
  • 数据量过大,无法一次性处理

What’s the Problem?

  • 数据流离模型训练越来越远
  • 仍然使用传统的方式处理数据,
    数据流将成为阻塞训练的瓶颈。

How to train
on internet-scale
data?

How to train
on internet-scale
data?

Just training on
the internet!

Streaming to the rescue

  • 流式传输数据可以解决这些问题

  • 但流式数据传输只是一个开始,我们需要构建完全基于流式数据的训练框架

Streaming to the rescue

Streaming to the rescue

Training on the internet

使用S3作为数据和权重的存储后端, 无缝进行不同规模的云迁移

Training on the internet

引入DPU集群,允许将数据直接传输到GPU, 消除内存数据库的开销

Powered by   ChatGPT Logo

Training on the internet

Training on the internet

  • 进一步分离了数据处理和模型训练
  • 使ETL与模型训练完全并行

Scaling Exact Attention

Efficient distributed training infra

Flash-Attn-2 FP8 (H100) 3D Parallel + Zero Padding Free Fused Kernel Static Graph TGS1
Platformers ✔️ ✔️ ✔️ ✔️ 100% ✔️ 3743
Megatron-LM ✖️ ✔️ ✔️ ✖️ 80% ✖️ 3581
Deepspeed ✔️ ✖️ ✔️ ✖️ 60% ✖️ ✖️
Colossal-ai ✖️ ✖️ ✔️ ✖️ 40% ✖️ 2610

Scaling exact attention to ultra long sequence

Scaling exact attention to ultra long sequence

Scaling exact attention to ultra long sequence

Scaling exact attention to ultra long sequence

Fireworks exploding in the sky

Waves crashing against the shore

A bustling street in London with red telephone booths and Big Ben in the background

Camera pans left to right on mago slices sitting on a table

Two balls thown in the air

Slow motion flower petals falling on the ground

A burning campire in a forest

A boat sailing on a stormy ocean

Text to Video. Our model generates videos based on text prompts, autoregressively.

Thanks

wechat

e-mail