Spaces:
Running
Running
Commit
·
45ff7d5
1
Parent(s):
1355d8e
fix pp
Browse files- src/index.html +5 -1
src/index.html
CHANGED
@@ -1252,9 +1252,13 @@
|
|
1252 |
|
1253 |
<p>Sequence and context parallelism can help for long sequences but don’t help much if sequence length is not the root cause of our memory issues but rather the size of the model itself. For large model (70B+), the size of the weights alone can already push past the limits of the 4-8 GPUs on a single node. We can solve this issue by summoning the fourth (and last) parallelism dimension: “pipeline parallelism”.</p>
|
1254 |
|
|
|
|
|
1255 |
<p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p>
|
1256 |
|
1257 |
-
<p>
|
|
|
|
|
1258 |
|
1259 |
<h3>Splitting layers on various nodes - All forward, all backward</h3>
|
1260 |
|
|
|
1252 |
|
1253 |
<p>Sequence and context parallelism can help for long sequences but don’t help much if sequence length is not the root cause of our memory issues but rather the size of the model itself. For large model (70B+), the size of the weights alone can already push past the limits of the 4-8 GPUs on a single node. We can solve this issue by summoning the fourth (and last) parallelism dimension: “pipeline parallelism”.</p>
|
1254 |
|
1255 |
+
<p>Pipeline parallelism is a simple but powerful technique - we split our model's layers across multiple GPUs! For example, if we have 8 GPUs, we could put layers 1-4 on GPU 1, layers 5-8 on GPU 2, and so on. This way, each GPU only needs to store and process a portion of the model's layers, significantly reducing the memory requirements per GPU. Let's take the example of a 8B model:</p>
|
1256 |
+
|
1257 |
<p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p>
|
1258 |
|
1259 |
+
<p>Looking at the figure above, we notice something interesting: while the parameters are nicely split across GPUs, the activation memory remains the same on each GPU! This is because each GPU still needs to process the full batch of data, just with different layers. The activations from one GPU's layers need to be sent to the next GPU to continue the forward pass.</p>
|
1260 |
+
|
1261 |
+
<p>This introduces a new type of communication pattern: instead of communicating parameters like in data parallelism with ZeRO, we're now passing activation tensors sequentially between GPUs in a "pipeline". While conceptually simple, implementing this efficiently is quite tricky. Let's dive into the details!</p>
|
1262 |
|
1263 |
<h3>Splitting layers on various nodes - All forward, all backward</h3>
|
1264 |
|