nouamanetazi HF staff commited on
Commit
1d7bb53
Β·
1 Parent(s): 6edf7e0
Files changed (2) hide show
  1. dist/index.html +4 -2
  2. src/index.html +4 -2
dist/index.html CHANGED
@@ -1077,7 +1077,7 @@
1077
  <tbody>
1078
  <tr>
1079
  <td>Embedding Layer (Row Linear sharded on vocab)</td>
1080
- <td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s: unchanged</td>
1081
  <td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
1082
  </tr>
1083
  </tbody>
@@ -1436,12 +1436,14 @@
1436
  <h2>Expert parallelism</h2>
1437
  <p>One more <s>thing</s> parallelism.</p>
1438
 
 
 
1439
  <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context:</p>
1440
 
1441
  <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
1442
  <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
1443
 
1444
- <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert’s feedforward layer on a different worker. Compared to TP it’s much more lightweight, since we don’t need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert. There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead.</p>
1445
 
1446
  <p>While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction. </p>
1447
 
 
1077
  <tbody>
1078
  <tr>
1079
  <td>Embedding Layer (Row Linear sharded on vocab)</td>
1080
+ <td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s: full</td>
1081
  <td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
1082
  </tr>
1083
  </tbody>
 
1436
  <h2>Expert parallelism</h2>
1437
  <p>One more <s>thing</s> parallelism.</p>
1438
 
1439
+ <p>Before diving into Expert Parallelism, we recommend reading about the Mixture-of-Experts (MoE) architecture in <a href="https://huggingface.co/blog/moe">this blog post</a> to better understand the concepts.</p>
1440
+
1441
  <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context:</p>
1442
 
1443
  <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
1444
  <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
1445
 
1446
+ <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert's feedforward layer on a different worker. Compared to TP it's much more lightweight, since we don't need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert. There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead.</p>
1447
 
1448
  <p>While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction. </p>
1449
 
src/index.html CHANGED
@@ -1077,7 +1077,7 @@
1077
  <tbody>
1078
  <tr>
1079
  <td>Embedding Layer (Row Linear sharded on vocab)</td>
1080
- <td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s: unchanged</td>
1081
  <td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
1082
  </tr>
1083
  </tbody>
@@ -1436,12 +1436,14 @@
1436
  <h2>Expert parallelism</h2>
1437
  <p>One more <s>thing</s> parallelism.</p>
1438
 
 
 
1439
  <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context:</p>
1440
 
1441
  <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
1442
  <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
1443
 
1444
- <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert’s feedforward layer on a different worker. Compared to TP it’s much more lightweight, since we don’t need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert. There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead.</p>
1445
 
1446
  <p>While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction. </p>
1447
 
 
1077
  <tbody>
1078
  <tr>
1079
  <td>Embedding Layer (Row Linear sharded on vocab)</td>
1080
+ <td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s: full</td>
1081
  <td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
1082
  </tr>
1083
  </tbody>
 
1436
  <h2>Expert parallelism</h2>
1437
  <p>One more <s>thing</s> parallelism.</p>
1438
 
1439
+ <p>Before diving into Expert Parallelism, we recommend reading about the Mixture-of-Experts (MoE) architecture in <a href="https://huggingface.co/blog/moe">this blog post</a> to better understand the concepts.</p>
1440
+
1441
  <p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context:</p>
1442
 
1443
  <p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
1444
  <p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
1445
 
1446
+ <p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert's feedforward layer on a different worker. Compared to TP it's much more lightweight, since we don't need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert. There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead.</p>
1447
 
1448
  <p>While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction. </p>
1449