nouamanetazi HF staff commited on
Commit
45ff7d5
·
1 Parent(s): 1355d8e
Files changed (1) hide show
  1. src/index.html +5 -1
src/index.html CHANGED
@@ -1252,9 +1252,13 @@
1252
 
1253
  <p>Sequence and context parallelism can help for long sequences but don’t help much if sequence length is not the root cause of our memory issues but rather the size of the model itself. For large model (70B+), the size of the weights alone can already push past the limits of the 4-8 GPUs on a single node. We can solve this issue by summoning the fourth (and last) parallelism dimension: “pipeline parallelism”.</p>
1254
 
 
 
1255
  <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p>
1256
 
1257
- <p>Pipeline Parallelism is conceptually very simple –we’ll simply spread the layers of our model across GPUs but the devil lies in implementing it efficiently. Let’s dive in it!</p>
 
 
1258
 
1259
  <h3>Splitting layers on various nodes - All forward, all backward</h3>
1260
 
 
1252
 
1253
  <p>Sequence and context parallelism can help for long sequences but don’t help much if sequence length is not the root cause of our memory issues but rather the size of the model itself. For large model (70B+), the size of the weights alone can already push past the limits of the 4-8 GPUs on a single node. We can solve this issue by summoning the fourth (and last) parallelism dimension: “pipeline parallelism”.</p>
1254
 
1255
+ <p>Pipeline parallelism is a simple but powerful technique - we split our model's layers across multiple GPUs! For example, if we have 8 GPUs, we could put layers 1-4 on GPU 1, layers 5-8 on GPU 2, and so on. This way, each GPU only needs to store and process a portion of the model's layers, significantly reducing the memory requirements per GPU. Let's take the example of a 8B model:</p>
1256
+
1257
  <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p>
1258
 
1259
+ <p>Looking at the figure above, we notice something interesting: while the parameters are nicely split across GPUs, the activation memory remains the same on each GPU! This is because each GPU still needs to process the full batch of data, just with different layers. The activations from one GPU's layers need to be sent to the next GPU to continue the forward pass.</p>
1260
+
1261
+ <p>This introduces a new type of communication pattern: instead of communicating parameters like in data parallelism with ZeRO, we're now passing activation tensors sequentially between GPUs in a "pipeline". While conceptually simple, implementing this efficiently is quite tricky. Let's dive into the details!</p>
1262
 
1263
  <h3>Splitting layers on various nodes - All forward, all backward</h3>
1264