Spaces:

nanotron
/

ultrascale-playbook

Running

File size: 3,832 Bytes

89bae42

<h2>The Ultra-Scale Playbook: Training LLMs on GPU Clusters</h2>

<h2>TL;DR</h2>

<h2>First Steps: Training on one GPU</h2>

<h3>Memory usage in Transformers</h3>

<h4>Memory profiling a training step</h4>

<h4>Weights/grads/optimizer states memory</h4>

<h4>Activations memory</h4>

<h3><strong>Activation recomputation</strong></h3>

<h3>Gradient accumulation</h3>

<h2>Data Parallelism</h2>

<h4><strong>First optimization:</strong> Overlap gradient synchronization with backward pass</h4>

<h4><strong>Second optimization:</strong> Bucketing gradients</h4>

<h4><strong>Third optimization: I</strong>nterplay with gradient accumulation</h4>

<h3>Revisit global batch size</h3>

<h3>Our journey up to now</h3>

<h3>ZeRO (<strong>Ze</strong>ro <strong>R</strong>edundancy <strong>O</strong>ptimizer)</h3>

<h4>Memory usage revisited</h4>

<h4>ZeRO-1: Partitioning Optimizer States</h4>

<h4>ZeRO-2: Adding <strong>Gradient Partitioning</strong></h4>

<h4>ZeRO-3: Adding Parameter <strong>Partitioning</strong></h4>

<h2>Tensor Parallelism</h2>

<h3>Tensor Parallelism in a Transformer Block</h3>

<h3>Sequence Parallelism</h3>

<h2>Context Parallelism</h2>

<h3>Introducing Context Parallelism</h3>

<h3>Discovering Ring Attention</h3>

<h3>Zig-Zag Ring Attention – A Balanced Compute Implementation</h3>

<h2></h2>

<h2>Pipeline Parallelism</h2>

<h3>Splitting layers on various nodes - All forward, all backward</h3>

<h3>One-forward-one-backward and LLama 3.1 schemes</h3>

<h3>Interleaving stages</h3>

<h3>Zero Bubble and DualPipe</h3>

<h2>Expert parallelism</h2>

<h2>5D parallelism in a nutshell</h2>

<h2>How to Find the Best Training Configuration</h2>

<h2>Diving in the GPUs – fusing, threading, mixing</h2>

<h4>A primer on GPU</h4>

<h3>How to improve performance with Kernels ?</h3>

<h4>Memory Coalescing</h4>

<h4>Tiling</h4>

<h4>Thread Coarsening</h4>

<h4>Minimizing Control Divergence</h4>

<h3>Flash Attention 1-3</h3>

<h3>Fused Kernels</h3>

<h3>Mixed Precision Training</h3>

<h4>FP16 and BF16 training</h4>

<h4>FP8 pretraining</h4>

<h2>Conclusion</h2>

<h3>What you learned</h3>

<h3>What we learned</h3>

<h3>What’s next?</h3>

<h2>References</h2>

<h3>Landmark LLM Scaling Papers</h3>

<h3>Training Frameworks</h3>

<h3>Debugging</h3>

<h3>Distribution Techniques</h3>

<h3>CUDA Kernels</h3>

<h3>Hardware</h3>

<h3>Others</h3>

<h2>Appendix</h2>

<h3>A0: Parallel Programming Crash Course</h3>

<h4>Broadcast</h4>

<h4>Reduce &amp; AllReduce</h4>

<h4><strong>A quick focus on Ring All-Reduce</strong></h4>

<h4>Gather &amp; AllGather</h4>

<h4>Scatter &amp; ReduceScatter</h4>

<h4>Barrier</h4>

<h4>NCCL: NVIDIA Collective Communications Library</h4>

<h3>A1: Profiling</h3>

<h4>Kernels</h4>

<h2>Print a table of the profiling results, sorted by total CUDA time, limited to the top 10 entries</h2>

<h2>include <torch/extension.h></h2>

<h2>include <cuda.h></h2>

<h2>include <cuda_runtime.h></h2>

<h2>Load and compile the CUDA extension</h2>

<h2>Define input tensors</h2>

<h2>Run the CUDA kernel</h2>

<h3>A2: TP Backward pass</h3>

<h3>A3: ZeRO-R</h3>

<h4>$P_a:$  Partitioned Activation Checkpointing</h4>

<h4><strong>$C_B:$ Constant Size Buffers</strong></h4>

<h4><strong>$M_D$: Memory Defragmentation</strong></h4>

<h4>Communication Analysis of ZeRO-R</h4>

<h3>A5. Memory profile</h3>

<h2>Set up optimizer</h2>

<h3>TP: Practical PyTorch Implementation</h3>

<h2>This is the <code>f</code> function in the paper: https://arxiv.org/abs/1909.08053</h2>

<h2>core logic of Column Parallel linear</h2>

<h4>Gelu code</h4>

<h4>Interconnect</h4>

<h3>How to profile your code</h3>

<h3>Formulas for compute / comms the balanhe balance</h3>

<h3>Integrating Context Parallelism with TP/SP</h3>

<h3>The nanotron FP8 recipe</h3>

<h2>Overlapping computation and communication</h2>