Spaces:

nanotron
/

ultrascale-playbook

Running

App Files Files Community

114

nouamanetazi HF Staff commited on Feb 17

Commit

1d7bb53

1 Parent(s): 6edf7e0

.

Browse files

Files changed (2) hide show

dist/index.html +4 -2
src/index.html +4 -2

dist/index.html CHANGED Viewed

@@ -1077,7 +1077,7 @@
             <tbody>
               <tr>
                 <td>Embedding Layer (Row Linear sharded on vocab)</td>
-                <td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s: unchanged</td>
                 <td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
               </tr>
             </tbody>
@@ -1436,12 +1436,14 @@
         <h2>Expert parallelism</h2>
         <p>One more <s>thing</s> parallelism.</p>
         <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context:</p>
         <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
         <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
-        <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert’s feedforward layer on a different worker. Compared to TP it’s much more lightweight, since we don’t need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert. There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead.</p>
         <p>While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction. </p>

             <tbody>
               <tr>
                 <td>Embedding Layer (Row Linear sharded on vocab)</td>
+                <td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s: full</td>
                 <td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
               </tr>
             </tbody>
         <h2>Expert parallelism</h2>
         <p>One more <s>thing</s> parallelism.</p>
+        <p>Before diving into Expert Parallelism, we recommend reading about the Mixture-of-Experts (MoE) architecture in <a href="https://huggingface.co/blog/moe">this blog post</a> to better understand the concepts.</p>
         <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context:</p>
         <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
         <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
+        <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert's feedforward layer on a different worker. Compared to TP it's much more lightweight, since we don't need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert. There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead.</p>
         <p>While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction. </p>

src/index.html CHANGED Viewed

@@ -1077,7 +1077,7 @@
             <tbody>
               <tr>
                 <td>Embedding Layer (Row Linear sharded on vocab)</td>
-                <td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s: unchanged</td>
                 <td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
               </tr>
             </tbody>
@@ -1436,12 +1436,14 @@
         <h2>Expert parallelism</h2>
         <p>One more <s>thing</s> parallelism.</p>
         <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context:</p>
         <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
         <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
-        <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert’s feedforward layer on a different worker. Compared to TP it’s much more lightweight, since we don’t need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert. There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead.</p>
         <p>While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction. </p>

             <tbody>
               <tr>
                 <td>Embedding Layer (Row Linear sharded on vocab)</td>
+                <td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s: full</td>
                 <td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
               </tr>
             </tbody>
         <h2>Expert parallelism</h2>
         <p>One more <s>thing</s> parallelism.</p>
+        <p>Before diving into Expert Parallelism, we recommend reading about the Mixture-of-Experts (MoE) architecture in <a href="https://huggingface.co/blog/moe">this blog post</a> to better understand the concepts.</p>
         <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context:</p>
         <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
         <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
+        <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert's feedforward layer on a different worker. Compared to TP it's much more lightweight, since we don't need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert. There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead.</p>
         <p>While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction. </p>