File size: 20,526 Bytes
ce0fa1e 2fc4496 ce0fa1e 2fc4496 ce0fa1e 2fc4496 ce0fa1e 2fc4496 ce0fa1e 2fc4496 ce0fa1e 2fc4496 ce0fa1e 2fc4496 ce0fa1e 2fc4496 ce0fa1e 2fc4496 ce0fa1e 2fc4496 ce0fa1e 2fc4496 ce0fa1e 2fc4496 ce0fa1e 2fc4496 ce0fa1e 2fc4496 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 |
---
strip-comments: true
bibliography: ["ref.bib"]
format:
revealjs:
logo: ./figures/logo/sustech.png
# footer: |
# <p align="right">
# <img src="./figures/logo/sustech.png?sanitize=true" height="60px">>
# </p>
slide-number: true
multiplex: false
show-notes: false
theme: sustech.scss
show-slide-number: all
controls: false
preview-links: true
transition: "slide"
preload-iframes: true
view-distance: 10
width: 1280
height: 720
mermaid:
theme: dark
code-overflow: wrap
callout-icon: false
execute:
echo: false
revealjs-plugins:
- verticator
- codewindow
- qrcode
---
## {.theme-title .center}
::: {.titlebox style="text-align:center; font-size: 2em;"}
[Modeling on Internet-scale Data]{.adlery style="color:#320005;"}
[Bingyi Jing@ML-Summit]{style="font-size:0.5em;"}
<!-- [Songxin Zhang@CHUKSZ]{style="font-size:0.5em;"} -->
[Apr 25th, 2024]{style="font-size:0.5em;"}
:::
## {.theme-content}
:::: columns
::: {.column width="30%"}
:::
::: {.column width="70%"}
<br>
::: {.titlebox style="font-size: 1.5em;"}
- LLM/LVM is Data-hungry
:::
::: {.titlebox style="font-size: 1.5em;"}
- Streaming Data Flow
:::
::: {.titlebox style="font-size: 1.5em;"}
- Scaling <span>Exact</span> Attention
:::
:::
::::
::: {.notes}
- 多模态大模型面临的新挑战
- 如何处理并训练互联网规模的海量数据
- 如何建模超长序列
:::
# {.theme-section}
::: {.title}
<b>
LLM/LVM is Data-<span>hungry</span>
</b>
:::
## Revisiting the Pre-GPT Era
<!-- ::: {.incremental} -->
对于相同的文本输入,不同的任务需要不同的标注数据和模型。
::: columns
::: {.column width="50%"}
- 情绪分析 ([IMDB](https://huggingface.co/datasets/imdb): 100k rows, 84.1MB)
```{mermaid}
flowchart LR
markdown["我今天去国家大剧院看了一场精彩的演出"]
newLines["积极"]
markdown --> newLines
```
- 命名实体识别 ([CoNLL-2003](https://huggingface.co/datasets/jnlpba): 20k rows, 15MB)
```{mermaid}
flowchart TD
A["我"]
B["今天去"]
C["国家大剧院"]
D["看了一场"]
E["精彩的演出"]
A-->AN[人物]
B-->BN[时间]
C-->CN["地点"]
E-->EN["事件"]
```
:::
::: {.column width="50%"}
- 文本翻译 ([wmt19](https://huggingface.co/datasets/wmt19/viewer/cs-en/train): 15M rows, 1.25GB)
```{mermaid}
flowchart TD
markdown["我今天去国家大剧院看了一场精彩的演出"]
newLines["
I went to the National Grand Theater today and watched a wonderful performance."]
markdown <--> newLines
```
- ...
:::
:::
## Unified Task Paradigm {auto-animate=true}
<br>
{{< fa solid user >}} **You**
<!-- <a style="display: flex; align-items: center;"> -->
<!-- <img src="https://cdn3.emoji.gg/emojis/6473-user-icon.png" width="44px" height="44px" alt="ChatGPT Logo" style="margin-right: 10px;"> -->
<!-- <span style="color: white; font-weight: bold;">ChatGPT</span> -->
<!-- </a> -->
::: {style="margin-left: 54px;"}
请问“我今天去国家大剧院看了一场精彩的演出”表达了怎样的情绪?
:::
{{< include chatgpt.qmd >}}
::: {style="margin-left: 54px;"}
积极
:::
## Unified Task Paradigm {auto-animate=true}
<br>
{{< fa solid user >}} **You**
::: {style="margin-left: 54px;"}
请抽取以下文本中出现的时间、地点、任务和发生的时间:“我今天去国家大剧院看了一场精彩的演出”。
:::
{{< include chatgpt.qmd >}}
::: {style="margin-left: 54px;font-size: 0.8em;"}
在您提供的文本中:
- 时间:今天
- 地点:国家大剧院
- 任务:看了一场精彩的演出
- 发生的时间:不明确,但由“今天”可以推断是在今天发生的。
:::
## Unified Task Paradigm {auto-animate=true}
<br>
{{< fa solid user >}} **You**
::: {style="margin-left: 54px;"}
将“我今天去国家大剧院看了一场精彩的演出”翻译成英文。
:::
{{< include chatgpt.qmd >}}
::: {style="margin-left: 54px;"}
"I went to the National Grand Theater today and watched a wonderful performance."
:::
## Unified Task Paradigm {auto-animate=true}
::: columns
::: {.column width="50%"}
![](./figures/causal_modeling.svg)
:::
::: {.column width="50%"}
![](./figures/masked_modeling.svg)
:::
:::
::: columns
::: {.column width="50%"}
::: {.fragment .strike}
数据集难以获取, 大小受限
:::
:::
::: {.column width="50%"}
::: {.fragment}
任何文章都可以作为数据进行训练
:::
:::
:::
::: columns
::: {.column width="50%"}
::: {.fragment .strike}
不同模型之间不能共享知识
:::
:::
::: {.column width="50%"}
::: {.fragment}
只需要一个模型
:::
:::
:::
::: columns
::: {.column width="50%"}
::: {.fragment .strike}
无标注的数据很多, 但是很难利用起来.
:::
:::
::: {.column width="50%"}
::: {.fragment}
数据无需标注, 可以自然直接对文档进行训练.
:::
:::
:::
## Pretrained models are data-hungry {auto-animate=true}
<!-- <iframe class="slide-deck" src="nlp.html"></iframe> -->
```{=html}
{{< include components/nlp.qmd >}}
```
::: {style="text-align:center; font-size: 0.4em;"}
The official datasets hosted on Hugging Face as of April 2024, categorized into a tree diagram by task type,<br> compared with the data used to pre-train GPT-3.
:::
现代的大语言模型,需要远超传统NLP的数据进行预训练。
## Pretrained models are data-hungry {auto-animate=true}
训练GPT-3使用了大约0.75TB的文本数据
- {{<fa solid spider>}} CommonCrawl 570GB
- <i class="fa-brands fa-reddit" style="color: #FF4500;"></i> WebText 50GB
- {{<fa brands wikipedia-w>}} Wikipedia 11GB
- {{<fa solid book>}} Books 21GB
- <i class="fa-solid fa-book-journal-whills"></i> Acadamic Journals 101GB
## Pretrained models are data-hungry {auto-animate=true}
训练GPT-3使用了大约0.75TB的文本数据
这样的训练量在如今看来并不算多
```{=html}
{{< include components/gpt.qmd >}}
```
## Pretrained models are data-hungry {auto-animate=true}
训练GPT-3使用了大约0.75TB的文本数据
这样的训练量在如今看来并不算多
::: {style="text-align:center;"}
<!-- ![](./figures/bubble-model.png) -->
![](./figures/2024-Alan-D-Thompson-AI-Bubbles-Planets-Rev-6.png){width=80%}
:::
## {auto-animate=true background-video="./figures/tokyo-walk.mp4"}
<!-- TODO: add title -->
## Dawning of the World Model Era {auto-animate=true .smaller background-video="./figures/tokyo-walk.mp4" background-opacity=0.25}
How many data SORA uses?
:::: columns
::: {.column width="50%"}
> We take inspiration from large language models which acquire generalist capabilities by training on **internet-scale** data [^1]
:::
::: {.column width="50%"}
> 一个可供对比的数据量是:每分钟上传至 YouTube 的视频是 500h 的量级。则近五年的 YouTube 上的视频数据总量为:13亿小时 = 788亿分钟 。由于Diffusion模型训练text to video 需要高质量的标注视频,因此我们可以估计Sora 训练的视频量级为1亿分钟左右。
>
> 目前有一个比较准确的估计, 一分钟视频约为 1M tokens 。[^2]
:::
::::
::: {.notes}
- 1.3 word ~= 1 token
- 参考 Diffusion Transformer, 256x256 的图片会被划分为 32x32 个 patch。 我们假设 1920x1080 分辨率的高清图像经过下采样得到 512x256 大小的图片,假设一个 patch 为 8x8 的像素块,则得到 64x32 大小的 patch 矩阵, 一张图片则约为: 64x32=2048 个 patch。 高清视频 1s 约为 30 帧以上,但实际训练和推理也会做压缩,我们估计压缩后 1s 约为 9 帧。 则一分钟共 540 帧。 一分钟的视频一共有:64x32x540=1.1M
昆仑万维
:::
## Dawning of the World Model Era {auto-animate=true background-video="./figures/tokyo-walk.mp4" background-opacity=0.25}
:::: columns
::: {.column width="50%"}
> 一个可供对比的数据量是:每分钟上传至 YouTube 的视频是 500h 的量级。则近五年的 YouTube 上的视频数据总量为:13亿小时 = 788亿分钟 。由于Diffusion模型训练text to video 需要高质量的标注视频,因此我们可以估计Sora 训练的视频量级为1亿分钟左右。
:::
::: {.column width="50%"}
::: {.r-fit-text}
~[500TB]{style="background-color:#e31c54; color:white;"} trained data
~[500PB]{style="background-color:#e31c54; color:white;"} raw data
:::
:::
::::
::: {.notes}
以一分钟高清视频大小5MB计算,1亿分钟的视频数据量为500TB,而筛选岀这高质量的1亿分钟可能需要500PB的原始数据。
:::
## Dawning of the World Model Era {auto-animate=true background-video="./figures/tokyo-walk.mp4" background-opacity=0.25}
```{=html}
{{< include components/token-bar.qmd >}}
```
## Challenge {auto-animate=true background-video="./figures/tokyo-walk.mp4" background-opacity=0.25}
:::: columns
::: {.column width="50%"}
::: {.r-fit-text}
Training on <br> [internet-]{.flow}<br> scale data
:::
:::
::: {.column width="45%"}
::: {.r-fit-text}
Modeling <br> [ultra-long]{.flow} <br> sequence
<!-- training on <br> [internet-scale]{.flow}<br> data? -->
:::
:::
::::
[^1]: [Video generation models as world simulators(SORA tech report)](https://openai.com/research/video-generation-models-as-world-simulators)
[^2]: [浅谈 Sora 未来的千倍推理算力需求](https://zhuanlan.zhihu.com/p/683636677)
# {.theme-section}
::: {.title}
<b>
<span>Streaming</span> Data Flow
</b>
:::
## Legacy training paradigm {auto-animate=true .smaller}
传统的训练方式通常是一次性将数据下载到本地,然后进行处理。
```{.python code-line-numbers="5-8|11-14"}
{{< include ./scripts/hf.py >}}
```
1. 下载数据集
2. 将数据集处理为模型输入,并保存到本地
3. 准备训练
::: notes
- 将全部数据下载到共享存储空间
- 一次性将数据处理为模型接受的形式
:::
## Legacy training paradigm {auto-animate=true}
:::: columns
::: {.column width="40%"}
![](./figures/etl-explain-large2.webp){width=90%}
:::
::: {.column width="60%" .fragment}
这种范式下ETL与模型训练完全串行,是一种简单明了的方式。
```{=html}
{{< include ./components/profile-old.qmd >}}
```
:::
::::
## What's the Problem? {auto-animate=true}
:::: columns
::: {.column width="60%"}
<p align="center">
![](./figures/etl-ai.jpg)
</p>
:::
::: {.column width="40%"}
多模态大模型的ETL流程正变得越来越复杂
- E: 数据模态多,来源复杂,拉取时间长
- T: 数据处理流程复杂
- L: 存储占用高
:::
::::
## What's the Problem? {auto-animate=true}
多模态数据由于版权和存储原因,大多以下载链接的形式分发,获取速率受到限制
```{=html}
{{< include components/webvid.qmd >}}
```
::: {style="text-align:center; font-size: 0.4em;"}
webvid以url形式提供, 共包括10730233条数据
:::
<!-- ![](./figures/webvid.webp) -->
::: {.notes}
- 这意味着国内需要使用昂贵的国际带宽来获取数据,对于一个小型数据中心, 下载相当于Sora训练量的数据集可能需要花费数年的时间。
- 即便只下载webvid这样中等规模的数据,下载和处理的时间可能也是训练的瓶颈.
:::
## What's the Problem? {auto-animate=true}
处理流程复杂耗时,甚至超过训练开销
:::: columns
::: {.column width="60%"}
::: {style="margin-top: 50px;"}
![](./figures/caption.jpg)
:::
:::
::: {.column width="40%"}
<a style="display: flex; align-items: center;">
<img src="https://cdn3.emoji.gg/emojis/5892-chatgpt-logo-circle.png" width="44px" height="44px" alt="ChatGPT Logo" style="margin-right: 10px;">
<span style="color: white; font-weight: bold;">GPT-4V (20s/it)</span>
</a>
::: {style="font-size: 0.4em;"}
An aerial video sweeps over a majestic ocean cliff with striatedrock formations in rich hues of red, green, and orange.The sun'srays enhance the colorful palette of the landscape, while the sea'sazure watersgently crash aqainst the clif's base. Visible are thetextured detais ofthe clif face and the contrast of the areeralaae and seaqrasscoating parts of the rock.Seabirds can beseen flying close to the water around the rocky outcrop. Thevideo conveys a serene yet dynamic coastal scenery, highlightingthe natural beauty and geological diversity of a rugged coastline
:::
🌋 **LLaVA-1.6-Yi-34B (3s/it)**
::: {style="font-size: 0.4em;"}
A breathtakina aerial view of a rocky cliff iutting out into theocean.The cliff. adorned with vibrant qreen moss and patches ofyellow and red lichen, is bathed in the warm glow of the sun. Theocean, a deep blue, is dotted with small white waves that crashagainst the cliff's base. The perspective of the video is fromabove, providing a bird's eye view of the cliff and the surroundingocean.The sun's rays illuminate the scene, casting a beautifullight on the cliff and the ocean. The video is a stunningrepresentation of the raw beauty of nature.
:::
:::
::::
## What's the Problem? {auto-animate=true}
> [Storage]{.red} plays an important role in AI training, and yet is one of the least talked-about aspects. As the GenAI training jobs become more multimodal over time, consuming large amounts of [ image, video, and text data ]{.red}, the need for data storage grows rapidly. [^llama3]
- 要从原始数据中筛选出一亿分钟数据,可能意味着原始数据量高达数十PB以上
- 对于一般的小型数据中心,没有能力搭建适应视频预训练的存储设施。
[^llama3]: [Building Meta’s GenAI Infrastructure](https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/)
## What's the Problem? {auto-animate=true}
:::: columns
::: {.column width="50%"}
<p align="center">
![](./figures/etl-problem.webp)
</p>
:::
::: {.column width="50%"}
::: {.incremental}
- 数据来源复杂
- 数据源不能立即被拉取
- 数据处理流程复杂
- 数据处理和模型训练耦合
- 数据量过大,无法一次性处理
- ...
:::
:::
::::
## What's the Problem? {auto-animate=true}
:::: columns
::: {.column width="50%"}
- 数据流离模型训练越来越远
- 仍然使用传统的方式处理数据,<br>数据流将成为阻塞训练的瓶颈。
::: {style="margin-left: 54px;"}
![](./figures/etl-explain-small.webp){width=80%}
:::
:::
::: {.column width="50%"}
```{=html}
{{< include ./components/profile-naive.qmd >}}
```
:::
::::
## {auto-animate=true}
::: {.r-fit-text}
How to train <br> on [internet-scale]{.flow}<br> data?
:::
## {auto-animate=true}
:::: columns
::: {.column width="50%"}
::: {.r-fit-text}
How to train <br> on [internet-scale]{.flow}<br> data?
:::
:::
::: {.column width="45%"}
::: {.r-fit-text .fragment}
Just training on <br>[the internet]{.flow}!
<!-- training on <br> [internet-scale]{.flow}<br> data? -->
:::
:::
::::
## Streaming to the rescue {auto-animate=true}
:::: columns
::: {.column width="50%"}
![](./figures/streaming.gif)
::: {.incremental}
- 流式传输数据可以解决这些问题
- 但流式数据传输只是一个开始,我们需要构建完全基于流式数据的训练框架
:::
:::
::: {.column width="50%"}
::: {.fragment}
![](./figures/decoup-data.png)
:::
:::
::::
<!-- ## {auto-animate=true background="./figures/mosaicml-streaming-dataset-img-1.gif"} -->
## Streaming to the rescue {auto-animate=true}
<p align="center">
![](./figures/plat.png){width=80%}
</p>
## Streaming to the rescue {auto-animate=true .smaller}
:::: columns
::: {.column width="60%"}
<!-- <p align="center"> -->
<!-- ![](./figures/mosaicml-streaming-dataset-img-1.gif){width=80%} -->
<!-- </p> -->
<p align="center">
![](./figures/plat.png){width=80%}
</p>
:::
::: {.column width="40%"}
::: {.incremental}
- [x] 零启动开销
- [x] 数据处理进程和模型训练进程完全分离
- [x] 节点内通过`SharedMemory`通信, 节点间通过内存数据库通信
- [x] 数据处理集群拓扑与GPU拓扑无关, 可以动态调整
- [x] 定时sink数据库,允许回溯数据流
- [x] 确定性的数据切分和洗牌算法,确保回溯的一致性
:::
:::
::::
## {auto-animate=true background="./figures/mosaicml-streaming-dataset-img-1.gif"}
::: {.notes}
每个云上shard内的样本具备确定性的切分和洗牌算法,确保回溯的一致性, 并与训练拓扑无关
:::
## Training on the internet {auto-animate=true .smaller background="./figures/mosaicml-streaming-dataset-img-1.gif" background-opacity=0.25}
使用S3作为数据和权重的存储后端, 无缝进行不同规模的云迁移
```{=html}
{{< include components/cloud-switch.qmd >}}
```
## Training on the internet {auto-animate=true .smaller}
引入DPU集群,允许将数据直接传输到GPU, 消除内存数据库的开销
<p align="center">
![](./figures/dpu.png){width=100%}
</p>
<!-- [Bingyi Jing@ML-Summit]{style="text-align:center; font-size:4em;"} -->
<a style="display: flex; align-items: center; justify-content: center;">
<span style="color: white; font-weight: bold;">Powered by </span>
<img src="./figures/logo/ucloud.png" height="44px" alt="ChatGPT Logo" style="margin-right: 10px;">
</a>
::: {.notes}
与中立云服务商UCloud合作
:::
## Training on the internet {auto-animate=true .smaller}
<p align="center">
![](./figures/streaming-data.webp){width=80%}
</p>
## Training on the internet {auto-animate=true .smaller}
:::: columns
::: {.column width="50%"}
<p align="center">
![](./figures/streaming-data.webp){width=60%}
</p>
:::
::: {.column width="50%"}
- 进一步分离了数据处理和模型训练
- 使ETL与模型训练完全并行
:::
::::
::: {.fragment}
```{=html}
{{< include components/profile-stream.qmd >}}
```
:::
# {.theme-section}
::: {.title}
<b>
Scaling <span>Exact</span> Attention
</b>
:::
## Efficient distributed training infra {auto-animate="true"}
| | Flash-Attn-2 | FP8 (H100) | 3D Parallel + Zero | Padding Free | Fused Kernel | Static Graph | TGS[^l] |
|------------:|:------------:|:----------:|:------------------:|:------------:|:------------:|:------------:|:---:|
| Platformers | ✔️ | ✔️ | ✔️ | ✔️ | [100%]{style="color:red;"} | ✔️ | [3743]{style="color:red;"} |
| Megatron-LM | ✖️ | ✔️ | ✔️ | ✖️ | 80% | ✖️ | 3581 |
| Deepspeed | ✔️ | ✖️ | ✔️ | ✖️ | 60% | ✖️ |✖️ |
| Colossal-ai | ✖️ | ✖️ | ✔️ | ✖️ | 40% | ✖️ | 2610 |
<!-- | GPT-Neox | ✖️ | ✖️ | ✔️ | ✖️ | 10% | ✖️ | -->
[^l]: Training LLaMA2 7b on DGX (8*A100 40GB) with 4096 sequence Length
## Scaling exact attention to ultra long sequence {auto-animate="true"}
<!-- ![](./figures/context_parallel.svg) -->
![](./figures/context_parallel.svg)
## Scaling exact attention to ultra long sequence {auto-animate="true"}
<p align="center">
![](./figures/computation_reduce.svg){width=80%}
</p>
## Scaling exact attention to ultra long sequence {auto-animate="true"}
:::: columns
::: {.column width="50%"}
<!-- ![](./figures/long_sequence_speed.png) -->
```{=html}
{{< include ./components/seq-time.qmd >}}
```
:::
::: {.column width="50%"}
<!-- ## ![](./figures/tflops_comparing.png) -->
```{=html}
{{< include ./components/seq-tflops.qmd >}}
```
:::
::::
<!-- ## Scaling exact attention to ultra long sequence {auto-animate="true"} -->
<!---->
<!-- ```{=html} -->
<!-- {{< include lwm.qmd >}} -->
<!-- ``` -->
<!---->
## Scaling exact attention to ultra long sequence {auto-animate="true"}
```{=html}
{{< include mocha.qmd >}}
```
# {.theme-end}
::: columns
::: {.column width="50%"}
::: {.r-fit-text}
Thanks
:::
:::
::: {.column width="25%"}
::: {style="text-align:center;"}
![wechat](./figures/qr/code.png)
:::
<!-- ## {{< qrcode https://u.wechat.com/MAmdMGMYjGFC4-2ESxZ1oyw width=200 height=200 >}} -->
:::
::: {.column width="25%"}
::: {style="text-align:center;"}
![e-mail](./figures/qr/mail-data.png)
:::
<!-- ## {{< qrcode https://u.wechat.com/MAmdMGMYjGFC4-2ESxZ1oyw width=200 height=200 >}} -->
:::
:::
|