luismirandacruz commited on
Commit
3aebabc
·
2 Parent(s): 701c95d b782b4a

merge with pr 97

Browse files
Files changed (2) hide show
  1. dist/index.html +3 -3
  2. src/index.html +1 -1
dist/index.html CHANGED
@@ -434,7 +434,7 @@
434
  \end{aligned}
435
  </d-math>
436
 
437
- <p>Now let’s have look how things change if we use a lower precision. For stability reason (see <a target="_self" href="#mixed_precision_training">the mixed-precision training section below</a>) we often don't use full low precision training but a mix of higher and lower precision called "mixed precision"<d-cite bibtex-key="micikevicius2018mixedprecisiontraining"></d-cite>. The default nowadays for mixed precision training is to generally use BF16 for most of the computations –requiring 2 bytes per parameter and gradient– as well as an additional copy of the model weights and gradients in FP32, thus 12 bytes per parameter in total. In addition to the parameters and gradient, we need to store the optimizer states: for the Adam optimizer, this requires the momentum and the variance usually stored in FP32 for numerical stability, each using 4 bytes. </p>
438
 
439
  <aside>See some more details below when we cover the ZeRO methods.</aside>
440
 
@@ -539,7 +539,7 @@
539
 
540
  <h3>Activation recomputation</h3>
541
 
542
- <p>The general idea behind <strong><em>activation recomputation</em></strong> – also called <em>gradient checkpointing</em> or <em>rematerialization</em> – is to discard some activations during the forward pass to save memory and spend some extra compute to recompute these on the fly during the backward pass. Without recomputation, we store every hidden state between two learnable operations (e.g. feed-forward, layernorm etc.), such that we can use them during the backward pass to compute gradients. When we use recomputation we typically will only store activations at a few key points along the model architecture, discard the rest of activations and recompute them on the fly during the backward pass from the nearest saved activations, basically performing again a sub-part of the forward pass to trade of memory for compute. It generally looks like this:</p>
543
 
544
  <div class="svg-container" id="svg-activation_recomputation"> </div>
545
  <div class="info" id="svg-activation_recomputation-info">Hover over the network elements to see their details</div>
@@ -548,7 +548,7 @@
548
 
549
  <ul>
550
  <li><strong>Full</strong>: We checkpoint activations at the transition point between each layer of the Transformer model. This is usually called the <code>full</code> strategy since it requires a forward pass through each layer essentially adding a full forward pass during the backward pass. This strategy saves the most memory but is the most expensive one in terms of compute. It generally increases the compute cost and time by up to 30-40% which is very noticeable.</li>
551
- <li><strong>Selective</strong>: In general we can do better than full. The authors of the recomputation paper<d-cite bibtex-key="korthikanti2022recomputation"></d-cite> did a detailed analysis studying which activations grow the largest and have the cheapest recomputation cost in terms of FLOPs. Turns out that the attention computations fall in that category, and thus we can usually discard them and focus on checkpointing expensive the feedforward computations. For a GPT-3 (175B) model this means <strong>70% activation memory reduction at a 2.7% compute cost</strong>.</li>
552
  </ul>
553
 
554
  <aside>In recent models like DeepSeek V3, selective checkpointing is performed, storing even a smaller size of attention activation —using so-called “Multi-Head Latent Attention” (MLA)– to optimize activation memory usage.</aside>
 
434
  \end{aligned}
435
  </d-math>
436
 
437
+ <p>Now let’s have look how things change if we use a lower precision. For stability reasons (see <a target="_self" href="#mixed_precision_training">the mixed-precision training section below</a>) we often don't use full low precision training but a mix of higher and lower precision called "mixed precision"<d-cite bibtex-key="micikevicius2018mixedprecisiontraining"></d-cite>. The default nowadays for mixed precision training is to generally use BF16 for most of the computations –requiring 2 bytes per parameter and gradient– as well as an additional copy of the model weights and gradients in FP32, thus 12 bytes per parameter in total. In addition to the parameters and gradient, we need to store the optimizer states: for the Adam optimizer, this requires the momentum and the variance usually stored in FP32 for numerical stability, each using 4 bytes. </p>
438
 
439
  <aside>See some more details below when we cover the ZeRO methods.</aside>
440
 
 
539
 
540
  <h3>Activation recomputation</h3>
541
 
542
+ <p>The general idea behind <strong><em>activation recomputation</em></strong> – also called <em>gradient checkpointing</em> or <em>rematerialization</em> – is to discard some activations during the forward pass to save memory and spend some extra compute to recompute these on the fly during the backward pass. Without recomputation, we store every hidden state between two learnable operations (e.g. feed-forward, layernorm etc.), such that we can use them during the backward pass to compute gradients. When we use recomputation we typically will only store activations at a few key points along the model architecture, discard the rest of activations and recompute them on the fly during the backward pass from the nearest saved activations, basically performing again a sub-part of the forward pass to trade off memory for compute. It generally looks like this:</p>
543
 
544
  <div class="svg-container" id="svg-activation_recomputation"> </div>
545
  <div class="info" id="svg-activation_recomputation-info">Hover over the network elements to see their details</div>
 
548
 
549
  <ul>
550
  <li><strong>Full</strong>: We checkpoint activations at the transition point between each layer of the Transformer model. This is usually called the <code>full</code> strategy since it requires a forward pass through each layer essentially adding a full forward pass during the backward pass. This strategy saves the most memory but is the most expensive one in terms of compute. It generally increases the compute cost and time by up to 30-40% which is very noticeable.</li>
551
+ <li><strong>Selective</strong>: In general we can do better than full. The authors of the recomputation paper<d-cite bibtex-key="korthikanti2022recomputation"></d-cite> did a detailed analysis studying which activations grow the largest and have the cheapest recomputation cost in terms of FLOPs. Turns out that the attention computations fall in that category, and thus we can usually discard them and focus on checkpointing the expensive feedforward computations. For a GPT-3 (175B) model this means <strong>70% activation memory reduction at a 2.7% compute cost</strong>.</li>
552
  </ul>
553
 
554
  <aside>In recent models like DeepSeek V3, selective checkpointing is performed, storing even a smaller size of attention activation —using so-called “Multi-Head Latent Attention” (MLA)– to optimize activation memory usage.</aside>
src/index.html CHANGED
@@ -434,7 +434,7 @@
434
  \end{aligned}
435
  </d-math>
436
 
437
- <p>Now let’s have look how things change if we use a lower precision. For stability reasons (see <a target="_self" href="#mixed_precision_training">the mixed-precision training section below</a>), we often don't use full low precision training but a mix of higher and lower precision called "mixed precision"<d-cite bibtex-key="micikevicius2018mixedprecisiontraining"></d-cite>. The default nowadays for mixed precision training is to generally use BF16 for most of the computations –requiring 2 bytes per parameter and gradient– as well as an additional copy of the model weights and gradients in FP32, thus 12 bytes per parameter in total. In addition to the parameters and gradient, we need to store the optimizer states: for the Adam optimizer, this requires the momentum and the variance usually stored in FP32 for numerical stability, each using 4 bytes. </p>
438
 
439
  <aside>See some more details below when we cover the ZeRO methods.</aside>
440
 
 
434
  \end{aligned}
435
  </d-math>
436
 
437
+ <p>Now, let’s have a look at how things change if we use a lower precision. For stability reasons (see <a target="_self" href="#mixed_precision_training">the mixed-precision training section below</a>), we often don't use full low precision training but a mix of higher and lower precision called "mixed precision"<d-cite bibtex-key="micikevicius2018mixedprecisiontraining"></d-cite>. The default nowadays for mixed precision training is to generally use BF16 for most of the computations –requiring 2 bytes per parameter and gradient– as well as an additional copy of the model weights and gradients in FP32, thus 12 bytes per parameter in total. In addition to the parameters and gradient, we need to store the optimizer states: for the Adam optimizer, this requires the momentum and the variance usually stored in FP32 for numerical stability, each using 4 bytes. </p>
438
 
439
  <aside>See some more details below when we cover the ZeRO methods.</aside>
440