Spaces:

nanotron
/

ultrascale-playbook

Running

App Files Files Community

115

nouamanetazi HF Staff commited on Feb 17

Commit

45ff7d5

1 Parent(s): 1355d8e

fix pp

Browse files

Files changed (1) hide show

src/index.html +5 -1

src/index.html CHANGED Viewed

@@ -1252,9 +1252,13 @@
         <p>Sequence and context parallelism can help for long sequences but don’t help much if sequence length is not the root cause of our memory issues but rather the size of the model itself. For large model (70B+), the size of the weights alone can already push past the limits of the 4-8 GPUs on a single node. We can solve this issue by summoning the fourth (and last) parallelism dimension: “pipeline parallelism”.</p>
         <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p>
-        <p>Pipeline Parallelism is conceptually very simple –we’ll simply spread the layers of our model across GPUs – but the devil lies in implementing it efficiently. Let’s dive in it!</p>
         <h3>Splitting layers on various nodes - All forward, all backward</h3>

         <p>Sequence and context parallelism can help for long sequences but don’t help much if sequence length is not the root cause of our memory issues but rather the size of the model itself. For large model (70B+), the size of the weights alone can already push past the limits of the 4-8 GPUs on a single node. We can solve this issue by summoning the fourth (and last) parallelism dimension: “pipeline parallelism”.</p>
+        <p>Pipeline parallelism is a simple but powerful technique - we split our model's layers across multiple GPUs! For example, if we have 8 GPUs, we could put layers 1-4 on GPU 1, layers 5-8 on GPU 2, and so on. This way, each GPU only needs to store and process a portion of the model's layers, significantly reducing the memory requirements per GPU. Let's take the example of a 8B model:</p>
         <p><img alt="pp_memoryusage.svg" src="/assets/images/pp_memoryusage.svg" /></p>
+        <p>Looking at the figure above, we notice something interesting: while the parameters are nicely split across GPUs, the activation memory remains the same on each GPU! This is because each GPU still needs to process the full batch of data, just with different layers. The activations from one GPU's layers need to be sent to the next GPU to continue the forward pass.</p>
+        <p>This introduces a new type of communication pattern: instead of communicating parameters like in data parallelism with ZeRO, we're now passing activation tensors sequentially between GPUs in a "pipeline". While conceptually simple, implementing this efficiently is quite tricky. Let's dive into the details!</p>
         <h3>Splitting layers on various nodes - All forward, all backward</h3>