File size: 3,832 Bytes
89bae42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
<h2>The Ultra-Scale Playbook: Training LLMs on GPU Clusters</h2>

<h2>TL;DR</h2>

<h2>First Steps: Training on one GPU</h2>

<h3>Memory usage in Transformers</h3>

<h4>Memory profiling a training step</h4>

<h4>Weights/grads/optimizer states memory</h4>

<h4>Activations memory</h4>

<h3><strong>Activation recomputation</strong></h3>

<h3>Gradient accumulation</h3>

<h2>Data Parallelism</h2>

<h4><strong>First optimization:</strong> Overlap gradient synchronization with backward pass</h4>

<h4><strong>Second optimization:</strong> Bucketing gradients</h4>

<h4><strong>Third optimization: I</strong>nterplay with gradient accumulation</h4>

<h3>Revisit global batch size</h3>

<h3>Our journey up to now</h3>

<h3>ZeRO (<strong>Ze</strong>ro <strong>R</strong>edundancy <strong>O</strong>ptimizer)</h3>

<h4>Memory usage revisited</h4>

<h4>ZeRO-1: Partitioning Optimizer States</h4>

<h4>ZeRO-2: Adding <strong>Gradient Partitioning</strong></h4>

<h4>ZeRO-3: Adding Parameter <strong>Partitioning</strong></h4>

<h2>Tensor Parallelism</h2>

<h3>Tensor Parallelism in a Transformer Block</h3>

<h3>Sequence Parallelism</h3>

<h2>Context Parallelism</h2>

<h3>Introducing Context Parallelism</h3>

<h3>Discovering Ring Attention</h3>

<h3>Zig-Zag Ring Attention – A Balanced Compute Implementation</h3>

<h2></h2>

<h2>Pipeline Parallelism</h2>

<h3>Splitting layers on various nodes - All forward, all backward</h3>

<h3>One-forward-one-backward and LLama 3.1 schemes</h3>

<h3>Interleaving stages</h3>

<h3>Zero Bubble and DualPipe</h3>

<h2>Expert parallelism</h2>

<h2>5D parallelism in a nutshell</h2>

<h2>How to Find the Best Training Configuration</h2>

<h2>Diving in the GPUs – fusing, threading, mixing</h2>

<h4>A primer on GPU</h4>

<h3>How to improve performance with Kernels ?</h3>

<h4>Memory Coalescing</h4>

<h4>Tiling</h4>

<h4>Thread Coarsening</h4>

<h4>Minimizing Control Divergence</h4>

<h3>Flash Attention 1-3</h3>

<h3>Fused Kernels</h3>

<h3>Mixed Precision Training</h3>

<h4>FP16 and BF16 training</h4>

<h4>FP8 pretraining</h4>

<h2>Conclusion</h2>

<h3>What you learned</h3>

<h3>What we learned</h3>

<h3>What’s next?</h3>

<h2>References</h2>

<h3>Landmark LLM Scaling Papers</h3>

<h3>Training Frameworks</h3>

<h3>Debugging</h3>

<h3>Distribution Techniques</h3>

<h3>CUDA Kernels</h3>

<h3>Hardware</h3>

<h3>Others</h3>

<h2>Appendix</h2>

<h3>A0: Parallel Programming Crash Course</h3>

<h4>Broadcast</h4>

<h4>Reduce &amp; AllReduce</h4>

<h4><strong>A quick focus on Ring All-Reduce</strong></h4>

<h4>Gather &amp; AllGather</h4>

<h4>Scatter &amp; ReduceScatter</h4>

<h4>Barrier</h4>

<h4>NCCL: NVIDIA Collective Communications Library</h4>

<h3>A1: Profiling</h3>

<h4>Kernels</h4>

<h2>Print a table of the profiling results, sorted by total CUDA time, limited to the top 10 entries</h2>

<h2>include <torch/extension.h></h2>

<h2>include <cuda.h></h2>

<h2>include <cuda_runtime.h></h2>

<h2>Load and compile the CUDA extension</h2>

<h2>Define input tensors</h2>

<h2>Run the CUDA kernel</h2>

<h3>A2: TP Backward pass</h3>

<h3>A3: ZeRO-R</h3>

<h4>$P_a:$  Partitioned Activation Checkpointing</h4>

<h4><strong>$C_B:$ Constant Size Buffers</strong></h4>

<h4><strong>$M_D$: Memory Defragmentation</strong></h4>

<h4>Communication Analysis of ZeRO-R</h4>

<h3>A5. Memory profile</h3>

<h2>Set up optimizer</h2>

<h3>TP: Practical PyTorch Implementation</h3>

<h2>This is the <code>f</code> function in the paper: https://arxiv.org/abs/1909.08053</h2>

<h2>core logic of Column Parallel linear</h2>

<h4>Gelu code</h4>

<h4>Interconnect</h4>

<h3>How to profile your code</h3>

<h3>Formulas for compute / comms the balanhe balance</h3>

<h3>Integrating Context Parallelism with TP/SP</h3>

<h3>The nanotron FP8 recipe</h3>

<h2>Overlapping computation and communication</h2>