Spaces:
Running
Running
File size: 3,832 Bytes
89bae42 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
<h2>The Ultra-Scale Playbook: Training LLMs on GPU Clusters</h2>
<h2>TL;DR</h2>
<h2>First Steps: Training on one GPU</h2>
<h3>Memory usage in Transformers</h3>
<h4>Memory profiling a training step</h4>
<h4>Weights/grads/optimizer states memory</h4>
<h4>Activations memory</h4>
<h3><strong>Activation recomputation</strong></h3>
<h3>Gradient accumulation</h3>
<h2>Data Parallelism</h2>
<h4><strong>First optimization:</strong> Overlap gradient synchronization with backward pass</h4>
<h4><strong>Second optimization:</strong> Bucketing gradients</h4>
<h4><strong>Third optimization: I</strong>nterplay with gradient accumulation</h4>
<h3>Revisit global batch size</h3>
<h3>Our journey up to now</h3>
<h3>ZeRO (<strong>Ze</strong>ro <strong>R</strong>edundancy <strong>O</strong>ptimizer)</h3>
<h4>Memory usage revisited</h4>
<h4>ZeRO-1: Partitioning Optimizer States</h4>
<h4>ZeRO-2: Adding <strong>Gradient Partitioning</strong></h4>
<h4>ZeRO-3: Adding Parameter <strong>Partitioning</strong></h4>
<h2>Tensor Parallelism</h2>
<h3>Tensor Parallelism in a Transformer Block</h3>
<h3>Sequence Parallelism</h3>
<h2>Context Parallelism</h2>
<h3>Introducing Context Parallelism</h3>
<h3>Discovering Ring Attention</h3>
<h3>Zig-Zag Ring Attention – A Balanced Compute Implementation</h3>
<h2></h2>
<h2>Pipeline Parallelism</h2>
<h3>Splitting layers on various nodes - All forward, all backward</h3>
<h3>One-forward-one-backward and LLama 3.1 schemes</h3>
<h3>Interleaving stages</h3>
<h3>Zero Bubble and DualPipe</h3>
<h2>Expert parallelism</h2>
<h2>5D parallelism in a nutshell</h2>
<h2>How to Find the Best Training Configuration</h2>
<h2>Diving in the GPUs – fusing, threading, mixing</h2>
<h4>A primer on GPU</h4>
<h3>How to improve performance with Kernels ?</h3>
<h4>Memory Coalescing</h4>
<h4>Tiling</h4>
<h4>Thread Coarsening</h4>
<h4>Minimizing Control Divergence</h4>
<h3>Flash Attention 1-3</h3>
<h3>Fused Kernels</h3>
<h3>Mixed Precision Training</h3>
<h4>FP16 and BF16 training</h4>
<h4>FP8 pretraining</h4>
<h2>Conclusion</h2>
<h3>What you learned</h3>
<h3>What we learned</h3>
<h3>What’s next?</h3>
<h2>References</h2>
<h3>Landmark LLM Scaling Papers</h3>
<h3>Training Frameworks</h3>
<h3>Debugging</h3>
<h3>Distribution Techniques</h3>
<h3>CUDA Kernels</h3>
<h3>Hardware</h3>
<h3>Others</h3>
<h2>Appendix</h2>
<h3>A0: Parallel Programming Crash Course</h3>
<h4>Broadcast</h4>
<h4>Reduce & AllReduce</h4>
<h4><strong>A quick focus on Ring All-Reduce</strong></h4>
<h4>Gather & AllGather</h4>
<h4>Scatter & ReduceScatter</h4>
<h4>Barrier</h4>
<h4>NCCL: NVIDIA Collective Communications Library</h4>
<h3>A1: Profiling</h3>
<h4>Kernels</h4>
<h2>Print a table of the profiling results, sorted by total CUDA time, limited to the top 10 entries</h2>
<h2>include <torch/extension.h></h2>
<h2>include <cuda.h></h2>
<h2>include <cuda_runtime.h></h2>
<h2>Load and compile the CUDA extension</h2>
<h2>Define input tensors</h2>
<h2>Run the CUDA kernel</h2>
<h3>A2: TP Backward pass</h3>
<h3>A3: ZeRO-R</h3>
<h4>$P_a:$ Partitioned Activation Checkpointing</h4>
<h4><strong>$C_B:$ Constant Size Buffers</strong></h4>
<h4><strong>$M_D$: Memory Defragmentation</strong></h4>
<h4>Communication Analysis of ZeRO-R</h4>
<h3>A5. Memory profile</h3>
<h2>Set up optimizer</h2>
<h3>TP: Practical PyTorch Implementation</h3>
<h2>This is the <code>f</code> function in the paper: https://arxiv.org/abs/1909.08053</h2>
<h2>core logic of Column Parallel linear</h2>
<h4>Gelu code</h4>
<h4>Interconnect</h4>
<h3>How to profile your code</h3>
<h3>Formulas for compute / comms the balanhe balance</h3>
<h3>Integrating Context Parallelism with TP/SP</h3>
<h3>The nanotron FP8 recipe</h3>
<h2>Overlapping computation and communication</h2>
|