Elie Bakouch
commited on
Commit
·
b782b4a
1
Parent(s):
985435d
fix typos
Browse files- dist/index.html +3 -3
- src/index.html +3 -3
dist/index.html
CHANGED
@@ -434,7 +434,7 @@
|
|
434 |
\end{aligned}
|
435 |
</d-math>
|
436 |
|
437 |
-
<p>Now let’s have look how things change if we use a lower precision. For stability
|
438 |
|
439 |
<aside>See some more details below when we cover the ZeRO methods.</aside>
|
440 |
|
@@ -539,7 +539,7 @@
|
|
539 |
|
540 |
<h3>Activation recomputation</h3>
|
541 |
|
542 |
-
<p>The general idea behind <strong><em>activation recomputation</em></strong> – also called <em>gradient checkpointing</em> or <em>rematerialization</em> – is to discard some activations during the forward pass to save memory and spend some extra compute to recompute these on the fly during the backward pass. Without recomputation, we store every hidden state between two learnable operations (e.g. feed-forward, layernorm etc.), such that we can use them during the backward pass to compute gradients. When we use recomputation we typically will only store activations at a few key points along the model architecture, discard the rest of activations and recompute them on the fly during the backward pass from the nearest saved activations, basically performing again a sub-part of the forward pass to trade
|
543 |
|
544 |
<div class="svg-container" id="svg-activation_recomputation"> </div>
|
545 |
<div class="info" id="svg-activation_recomputation-info">Hover over the network elements to see their details</div>
|
@@ -548,7 +548,7 @@
|
|
548 |
|
549 |
<ul>
|
550 |
<li><strong>Full</strong>: We checkpoint activations at the transition point between each layer of the Transformer model. This is usually called the <code>full</code> strategy since it requires a forward pass through each layer essentially adding a full forward pass during the backward pass. This strategy saves the most memory but is the most expensive one in terms of compute. It generally increases the compute cost and time by up to 30-40% which is very noticeable.</li>
|
551 |
-
<li><strong>Selective</strong>: In general we can do better than full. The authors of the recomputation paper<d-cite bibtex-key="korthikanti2022recomputation"></d-cite> did a detailed analysis studying which activations grow the largest and have the cheapest recomputation cost in terms of FLOPs. Turns out that the attention computations fall in that category, and thus we can usually discard them and focus on checkpointing expensive
|
552 |
</ul>
|
553 |
|
554 |
<aside>In recent models like DeepSeek V3, selective checkpointing is performed, storing even a smaller size of attention activation —using so-called “Multi-Head Latent Attention” (MLA)– to optimize activation memory usage.</aside>
|
|
|
434 |
\end{aligned}
|
435 |
</d-math>
|
436 |
|
437 |
+
<p>Now let’s have look how things change if we use a lower precision. For stability reasons (see <a target="_self" href="#mixed_precision_training">the mixed-precision training section below</a>) we often don't use full low precision training but a mix of higher and lower precision called "mixed precision"<d-cite bibtex-key="micikevicius2018mixedprecisiontraining"></d-cite>. The default nowadays for mixed precision training is to generally use BF16 for most of the computations –requiring 2 bytes per parameter and gradient– as well as an additional copy of the model weights and gradients in FP32, thus 12 bytes per parameter in total. In addition to the parameters and gradient, we need to store the optimizer states: for the Adam optimizer, this requires the momentum and the variance usually stored in FP32 for numerical stability, each using 4 bytes. </p>
|
438 |
|
439 |
<aside>See some more details below when we cover the ZeRO methods.</aside>
|
440 |
|
|
|
539 |
|
540 |
<h3>Activation recomputation</h3>
|
541 |
|
542 |
+
<p>The general idea behind <strong><em>activation recomputation</em></strong> – also called <em>gradient checkpointing</em> or <em>rematerialization</em> – is to discard some activations during the forward pass to save memory and spend some extra compute to recompute these on the fly during the backward pass. Without recomputation, we store every hidden state between two learnable operations (e.g. feed-forward, layernorm etc.), such that we can use them during the backward pass to compute gradients. When we use recomputation we typically will only store activations at a few key points along the model architecture, discard the rest of activations and recompute them on the fly during the backward pass from the nearest saved activations, basically performing again a sub-part of the forward pass to trade off memory for compute. It generally looks like this:</p>
|
543 |
|
544 |
<div class="svg-container" id="svg-activation_recomputation"> </div>
|
545 |
<div class="info" id="svg-activation_recomputation-info">Hover over the network elements to see their details</div>
|
|
|
548 |
|
549 |
<ul>
|
550 |
<li><strong>Full</strong>: We checkpoint activations at the transition point between each layer of the Transformer model. This is usually called the <code>full</code> strategy since it requires a forward pass through each layer essentially adding a full forward pass during the backward pass. This strategy saves the most memory but is the most expensive one in terms of compute. It generally increases the compute cost and time by up to 30-40% which is very noticeable.</li>
|
551 |
+
<li><strong>Selective</strong>: In general we can do better than full. The authors of the recomputation paper<d-cite bibtex-key="korthikanti2022recomputation"></d-cite> did a detailed analysis studying which activations grow the largest and have the cheapest recomputation cost in terms of FLOPs. Turns out that the attention computations fall in that category, and thus we can usually discard them and focus on checkpointing the expensive feedforward computations. For a GPT-3 (175B) model this means <strong>70% activation memory reduction at a 2.7% compute cost</strong>.</li>
|
552 |
</ul>
|
553 |
|
554 |
<aside>In recent models like DeepSeek V3, selective checkpointing is performed, storing even a smaller size of attention activation —using so-called “Multi-Head Latent Attention” (MLA)– to optimize activation memory usage.</aside>
|
src/index.html
CHANGED
@@ -434,7 +434,7 @@
|
|
434 |
\end{aligned}
|
435 |
</d-math>
|
436 |
|
437 |
-
<p>Now let’s have look how things change if we use a lower precision. For stability
|
438 |
|
439 |
<aside>See some more details below when we cover the ZeRO methods.</aside>
|
440 |
|
@@ -539,7 +539,7 @@
|
|
539 |
|
540 |
<h3>Activation recomputation</h3>
|
541 |
|
542 |
-
<p>The general idea behind <strong><em>activation recomputation</em></strong> – also called <em>gradient checkpointing</em> or <em>rematerialization</em> – is to discard some activations during the forward pass to save memory and spend some extra compute to recompute these on the fly during the backward pass. Without recomputation, we store every hidden state between two learnable operations (e.g. feed-forward, layernorm etc.), such that we can use them during the backward pass to compute gradients. When we use recomputation we typically will only store activations at a few key points along the model architecture, discard the rest of activations and recompute them on the fly during the backward pass from the nearest saved activations, basically performing again a sub-part of the forward pass to trade
|
543 |
|
544 |
<div class="svg-container" id="svg-activation_recomputation"> </div>
|
545 |
<div class="info" id="svg-activation_recomputation-info">Hover over the network elements to see their details</div>
|
@@ -548,7 +548,7 @@
|
|
548 |
|
549 |
<ul>
|
550 |
<li><strong>Full</strong>: We checkpoint activations at the transition point between each layer of the Transformer model. This is usually called the <code>full</code> strategy since it requires a forward pass through each layer essentially adding a full forward pass during the backward pass. This strategy saves the most memory but is the most expensive one in terms of compute. It generally increases the compute cost and time by up to 30-40% which is very noticeable.</li>
|
551 |
-
<li><strong>Selective</strong>: In general we can do better than full. The authors of the recomputation paper<d-cite bibtex-key="korthikanti2022recomputation"></d-cite> did a detailed analysis studying which activations grow the largest and have the cheapest recomputation cost in terms of FLOPs. Turns out that the attention computations fall in that category, and thus we can usually discard them and focus on checkpointing expensive
|
552 |
</ul>
|
553 |
|
554 |
<aside>In recent models like DeepSeek V3, selective checkpointing is performed, storing even a smaller size of attention activation —using so-called “Multi-Head Latent Attention” (MLA)– to optimize activation memory usage.</aside>
|
|
|
434 |
\end{aligned}
|
435 |
</d-math>
|
436 |
|
437 |
+
<p>Now let’s have look how things change if we use a lower precision. For stability reasons (see <a target="_self" href="#mixed_precision_training">the mixed-precision training section below</a>) we often don't use full low precision training but a mix of higher and lower precision called "mixed precision"<d-cite bibtex-key="micikevicius2018mixedprecisiontraining"></d-cite>. The default nowadays for mixed precision training is to generally use BF16 for most of the computations –requiring 2 bytes per parameter and gradient– as well as an additional copy of the model weights and gradients in FP32, thus 12 bytes per parameter in total. In addition to the parameters and gradient, we need to store the optimizer states: for the Adam optimizer, this requires the momentum and the variance usually stored in FP32 for numerical stability, each using 4 bytes. </p>
|
438 |
|
439 |
<aside>See some more details below when we cover the ZeRO methods.</aside>
|
440 |
|
|
|
539 |
|
540 |
<h3>Activation recomputation</h3>
|
541 |
|
542 |
+
<p>The general idea behind <strong><em>activation recomputation</em></strong> – also called <em>gradient checkpointing</em> or <em>rematerialization</em> – is to discard some activations during the forward pass to save memory and spend some extra compute to recompute these on the fly during the backward pass. Without recomputation, we store every hidden state between two learnable operations (e.g. feed-forward, layernorm etc.), such that we can use them during the backward pass to compute gradients. When we use recomputation we typically will only store activations at a few key points along the model architecture, discard the rest of activations and recompute them on the fly during the backward pass from the nearest saved activations, basically performing again a sub-part of the forward pass to trade off memory for compute. It generally looks like this:</p>
|
543 |
|
544 |
<div class="svg-container" id="svg-activation_recomputation"> </div>
|
545 |
<div class="info" id="svg-activation_recomputation-info">Hover over the network elements to see their details</div>
|
|
|
548 |
|
549 |
<ul>
|
550 |
<li><strong>Full</strong>: We checkpoint activations at the transition point between each layer of the Transformer model. This is usually called the <code>full</code> strategy since it requires a forward pass through each layer essentially adding a full forward pass during the backward pass. This strategy saves the most memory but is the most expensive one in terms of compute. It generally increases the compute cost and time by up to 30-40% which is very noticeable.</li>
|
551 |
+
<li><strong>Selective</strong>: In general we can do better than full. The authors of the recomputation paper<d-cite bibtex-key="korthikanti2022recomputation"></d-cite> did a detailed analysis studying which activations grow the largest and have the cheapest recomputation cost in terms of FLOPs. Turns out that the attention computations fall in that category, and thus we can usually discard them and focus on checkpointing the expensive feedforward computations. For a GPT-3 (175B) model this means <strong>70% activation memory reduction at a 2.7% compute cost</strong>.</li>
|
552 |
</ul>
|
553 |
|
554 |
<aside>In recent models like DeepSeek V3, selective checkpointing is performed, storing even a smaller size of attention activation —using so-called “Multi-Head Latent Attention” (MLA)– to optimize activation memory usage.</aside>
|