Spaces:
Running
Running
more thom
#57
by
thomwolf
HF staff
- opened
dist/assets/images/torch-compile-triton-kernel.png
CHANGED
![]() |
Git LFS Details
|
![]() |
Git LFS Details
|
dist/assets/images/torch-compile-triton.png
CHANGED
![]() |
Git LFS Details
|
![]() |
Git LFS Details
|
dist/index.html
CHANGED
@@ -1660,8 +1660,9 @@
|
|
1660 |
|
1661 |
<p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and can be combined with both Pipeline Parallelism and ZeRO-3 as it relies on the distributive property of matrix multiplications which allows weights and activations to be sharded and computed independently before being combined.</p>
|
1662 |
|
1663 |
-
|
1664 |
-
|
|
|
1665 |
|
1666 |
|
1667 |
<p>The main reason we don't want to use TP only for parallelism is that, in practice, TP has two limitations we've discussed in the previous sections: First, since its communication operations are part of the critical path of computation, it's difficult to scale well beyond a certain point at which communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more cumbersome to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
|
@@ -1672,16 +1673,17 @@
|
|
1672 |
|
1673 |
<p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. As we saw in <a target="_self" href="#context_parallelism"> CP section</a>, this is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where, even when using full activation recomputation, the memory requirements for attention would be prohibitive on a single GPU.</p>
|
1674 |
|
1675 |
-
<
|
1676 |
-
|
1677 |
-
|
1678 |
|
1679 |
|
1680 |
<p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication operation in EP is the `all-to-all` operations routing tokens to their assigned experts and gathering the results back. While this operation introduces some communication overhead, it enables scaling model capacity significantly since each token is only processed during inference (and training) by a much smaller fraction of the total parameters. In terms of distributed training/inference, partitioning experts across GPUs becomes relevant when models scales to a large number of experts.</p>
|
1681 |
<aside>For instance DeepSeek V3 uses 256 experts.</aside>
|
1682 |
|
1683 |
-
<
|
1684 |
-
|
|
|
1685 |
<div class="note-box">
|
1686 |
<p class="note-box-title">📝 Note</p>
|
1687 |
<div class="note-box-content">
|
@@ -1733,11 +1735,15 @@
|
|
1733 |
<p><strong>Summarizing it all–</strong> Now what about gathering and combining all the techniques we've seen in a single diagram combining them all. Yes, we're up for the challenge!</p>
|
1734 |
<p>In this summary diagram, you will find illustrated activations and modules for a single transformers layer –in it's MoE variant–. We also illustrate the various directions of parallelism and the communication operations we've been discussing in all the previous sections.</p>
|
1735 |
|
1736 |
-
<
|
|
|
|
|
1737 |
|
1738 |
<p>We can also represent side-by-side a <strong>full overview</strong> of the memory savings for each one of these strategies. We'll plot them with different sequence length as well as with selective (top) and full (bottom) recomputation so you can see how they all play with activations:</p>
|
1739 |
|
1740 |
-
<
|
|
|
|
|
1741 |
|
1742 |
<p>Let's finish this section with a high level view at all of these techniques, their main underlying idea and major bottleneck:</p>
|
1743 |
|
|
|
1660 |
|
1661 |
<p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and can be combined with both Pipeline Parallelism and ZeRO-3 as it relies on the distributive property of matrix multiplications which allows weights and activations to be sharded and computed independently before being combined.</p>
|
1662 |
|
1663 |
+
<div class="large-image-background">
|
1664 |
+
<img alt="TP & SP diagram" src="/assets/images/5D_nutshell_tp_sp.svg" style="width: 1200px; max-width: none;" />
|
1665 |
+
</div>
|
1666 |
|
1667 |
|
1668 |
<p>The main reason we don't want to use TP only for parallelism is that, in practice, TP has two limitations we've discussed in the previous sections: First, since its communication operations are part of the critical path of computation, it's difficult to scale well beyond a certain point at which communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more cumbersome to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
|
|
|
1673 |
|
1674 |
<p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. As we saw in <a target="_self" href="#context_parallelism"> CP section</a>, this is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where, even when using full activation recomputation, the memory requirements for attention would be prohibitive on a single GPU.</p>
|
1675 |
|
1676 |
+
<div class="large-image-background">
|
1677 |
+
<img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" style="width: 1200px; max-width: none;" />
|
1678 |
+
</div>
|
1679 |
|
1680 |
|
1681 |
<p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication operation in EP is the `all-to-all` operations routing tokens to their assigned experts and gathering the results back. While this operation introduces some communication overhead, it enables scaling model capacity significantly since each token is only processed during inference (and training) by a much smaller fraction of the total parameters. In terms of distributed training/inference, partitioning experts across GPUs becomes relevant when models scales to a large number of experts.</p>
|
1682 |
<aside>For instance DeepSeek V3 uses 256 experts.</aside>
|
1683 |
|
1684 |
+
<div class="large-image-background">
|
1685 |
+
<img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" style="width: 1200px; max-width: none;" />
|
1686 |
+
</div>
|
1687 |
<div class="note-box">
|
1688 |
<p class="note-box-title">📝 Note</p>
|
1689 |
<div class="note-box-content">
|
|
|
1735 |
<p><strong>Summarizing it all–</strong> Now what about gathering and combining all the techniques we've seen in a single diagram combining them all. Yes, we're up for the challenge!</p>
|
1736 |
<p>In this summary diagram, you will find illustrated activations and modules for a single transformers layer –in it's MoE variant–. We also illustrate the various directions of parallelism and the communication operations we've been discussing in all the previous sections.</p>
|
1737 |
|
1738 |
+
<div class="large-image-background">
|
1739 |
+
<p><img alt="image.png" src="/assets/images/5d_full.svg" style="width: 1200px; max-width: none;"/></p>
|
1740 |
+
</div>
|
1741 |
|
1742 |
<p>We can also represent side-by-side a <strong>full overview</strong> of the memory savings for each one of these strategies. We'll plot them with different sequence length as well as with selective (top) and full (bottom) recomputation so you can see how they all play with activations:</p>
|
1743 |
|
1744 |
+
<div class="large-image-background">
|
1745 |
+
<img alt="5Dparallelism_8Bmemoryusage.svg" src="/assets/images/5Dparallelism_8Bmemoryusage.svg" style="width: 1200px; max-width: none;"/>
|
1746 |
+
</div>
|
1747 |
|
1748 |
<p>Let's finish this section with a high level view at all of these techniques, their main underlying idea and major bottleneck:</p>
|
1749 |
|
src/index.html
CHANGED
@@ -1660,8 +1660,9 @@
|
|
1660 |
|
1661 |
<p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and can be combined with both Pipeline Parallelism and ZeRO-3 as it relies on the distributive property of matrix multiplications which allows weights and activations to be sharded and computed independently before being combined.</p>
|
1662 |
|
1663 |
-
|
1664 |
-
|
|
|
1665 |
|
1666 |
|
1667 |
<p>The main reason we don't want to use TP only for parallelism is that, in practice, TP has two limitations we've discussed in the previous sections: First, since its communication operations are part of the critical path of computation, it's difficult to scale well beyond a certain point at which communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more cumbersome to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
|
@@ -1672,16 +1673,17 @@
|
|
1672 |
|
1673 |
<p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. As we saw in <a target="_self" href="#context_parallelism"> CP section</a>, this is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where, even when using full activation recomputation, the memory requirements for attention would be prohibitive on a single GPU.</p>
|
1674 |
|
1675 |
-
<
|
1676 |
-
|
1677 |
-
|
1678 |
|
1679 |
|
1680 |
<p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication operation in EP is the `all-to-all` operations routing tokens to their assigned experts and gathering the results back. While this operation introduces some communication overhead, it enables scaling model capacity significantly since each token is only processed during inference (and training) by a much smaller fraction of the total parameters. In terms of distributed training/inference, partitioning experts across GPUs becomes relevant when models scales to a large number of experts.</p>
|
1681 |
<aside>For instance DeepSeek V3 uses 256 experts.</aside>
|
1682 |
|
1683 |
-
<
|
1684 |
-
|
|
|
1685 |
<div class="note-box">
|
1686 |
<p class="note-box-title">📝 Note</p>
|
1687 |
<div class="note-box-content">
|
@@ -1733,11 +1735,15 @@
|
|
1733 |
<p><strong>Summarizing it all–</strong> Now what about gathering and combining all the techniques we've seen in a single diagram combining them all. Yes, we're up for the challenge!</p>
|
1734 |
<p>In this summary diagram, you will find illustrated activations and modules for a single transformers layer –in it's MoE variant–. We also illustrate the various directions of parallelism and the communication operations we've been discussing in all the previous sections.</p>
|
1735 |
|
1736 |
-
<
|
|
|
|
|
1737 |
|
1738 |
<p>We can also represent side-by-side a <strong>full overview</strong> of the memory savings for each one of these strategies. We'll plot them with different sequence length as well as with selective (top) and full (bottom) recomputation so you can see how they all play with activations:</p>
|
1739 |
|
1740 |
-
<
|
|
|
|
|
1741 |
|
1742 |
<p>Let's finish this section with a high level view at all of these techniques, their main underlying idea and major bottleneck:</p>
|
1743 |
|
|
|
1660 |
|
1661 |
<p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and can be combined with both Pipeline Parallelism and ZeRO-3 as it relies on the distributive property of matrix multiplications which allows weights and activations to be sharded and computed independently before being combined.</p>
|
1662 |
|
1663 |
+
<div class="large-image-background">
|
1664 |
+
<img alt="TP & SP diagram" src="/assets/images/5D_nutshell_tp_sp.svg" style="width: 1200px; max-width: none;" />
|
1665 |
+
</div>
|
1666 |
|
1667 |
|
1668 |
<p>The main reason we don't want to use TP only for parallelism is that, in practice, TP has two limitations we've discussed in the previous sections: First, since its communication operations are part of the critical path of computation, it's difficult to scale well beyond a certain point at which communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more cumbersome to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
|
|
|
1673 |
|
1674 |
<p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. As we saw in <a target="_self" href="#context_parallelism"> CP section</a>, this is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where, even when using full activation recomputation, the memory requirements for attention would be prohibitive on a single GPU.</p>
|
1675 |
|
1676 |
+
<div class="large-image-background">
|
1677 |
+
<img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" style="width: 1200px; max-width: none;" />
|
1678 |
+
</div>
|
1679 |
|
1680 |
|
1681 |
<p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication operation in EP is the `all-to-all` operations routing tokens to their assigned experts and gathering the results back. While this operation introduces some communication overhead, it enables scaling model capacity significantly since each token is only processed during inference (and training) by a much smaller fraction of the total parameters. In terms of distributed training/inference, partitioning experts across GPUs becomes relevant when models scales to a large number of experts.</p>
|
1682 |
<aside>For instance DeepSeek V3 uses 256 experts.</aside>
|
1683 |
|
1684 |
+
<div class="large-image-background">
|
1685 |
+
<img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" style="width: 1200px; max-width: none;" />
|
1686 |
+
</div>
|
1687 |
<div class="note-box">
|
1688 |
<p class="note-box-title">📝 Note</p>
|
1689 |
<div class="note-box-content">
|
|
|
1735 |
<p><strong>Summarizing it all–</strong> Now what about gathering and combining all the techniques we've seen in a single diagram combining them all. Yes, we're up for the challenge!</p>
|
1736 |
<p>In this summary diagram, you will find illustrated activations and modules for a single transformers layer –in it's MoE variant–. We also illustrate the various directions of parallelism and the communication operations we've been discussing in all the previous sections.</p>
|
1737 |
|
1738 |
+
<div class="large-image-background">
|
1739 |
+
<p><img alt="image.png" src="/assets/images/5d_full.svg" style="width: 1200px; max-width: none;"/></p>
|
1740 |
+
</div>
|
1741 |
|
1742 |
<p>We can also represent side-by-side a <strong>full overview</strong> of the memory savings for each one of these strategies. We'll plot them with different sequence length as well as with selective (top) and full (bottom) recomputation so you can see how they all play with activations:</p>
|
1743 |
|
1744 |
+
<div class="large-image-background">
|
1745 |
+
<img alt="5Dparallelism_8Bmemoryusage.svg" src="/assets/images/5Dparallelism_8Bmemoryusage.svg" style="width: 1200px; max-width: none;"/>
|
1746 |
+
</div>
|
1747 |
|
1748 |
<p>Let's finish this section with a high level view at all of these techniques, their main underlying idea and major bottleneck:</p>
|
1749 |
|