Files changed (2) hide show
  1. dist/index.html +7 -18
  2. src/index.html +7 -18
dist/index.html CHANGED
@@ -38,17 +38,13 @@
38
  </d-front-matter>
39
  <d-title>
40
  <h1 class="l-page" style="text-align: center;">The Ultra-Scale Playbook:<br>Training LLMs on GPU Clusters</h1>
41
- <div id="title-plot" class="main-plot-container l-screen">
42
  <iframe id="banner"
43
- src="assets/data/benchmarks/banner.html" scrolling="no" frameborder="0" loading="lazy" style="display: block; margin: 0 auto; position: relative;">
 
44
  </iframe>
45
- <script>
46
- window.addEventListener('load', function() {
47
- const frame = document.getElementById('banner');
48
- frame.style.height = frame.contentWindow.document.documentElement.scrollHeight + 'px';
49
- frame.style.width = frame.contentWindow.document.documentElement.scrollWidth + 'px';
50
- });
51
- </script>
52
  </div>
53
  </d-title>
54
  <d-byline></d-byline>
@@ -181,16 +177,9 @@
181
 
182
  <p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p>
183
 
184
- <p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments with up to 512 GPUs to scan many possible distributed training layouts and model sizes. TODO: link to dataset too </p>
185
 
186
- <iframe id="plotFrame" src="assets/data/benchmarks/benchmarks_interactive.html" width="90%" scrolling="no" frameborder="0"></iframe>
187
- <script>
188
- window.addEventListener('load', function() {
189
- const frame = document.getElementById('plotFrame');
190
- frame.style.height = frame.contentWindow.document.documentElement.scrollHeight + 'px';
191
- frame.style.width = frame.contentWindow.document.documentElement.scrollWidth + 'px';
192
- });
193
- </script>
194
 
195
  <p>As you can see, there’s a lot of ground to be covered. Before getting into the trenches of distributed training let’s take a quick high level look on we’ll cover in the post.</p>
196
 
 
38
  </d-front-matter>
39
  <d-title>
40
  <h1 class="l-page" style="text-align: center;">The Ultra-Scale Playbook:<br>Training LLMs on GPU Clusters</h1>
41
+ <div id="title-plot" class="main-plot-container l-screen" style="overflow-x: hidden; width: 100%; text-align: center;">
42
  <iframe id="banner"
43
+ src="assets/data/benchmarks/banner.html" scrolling="no" frameborder="0" width="1200"
44
+ height="675" loading="lazy" style="margin: 0 auto; position: relative;">
45
  </iframe>
46
+ <p style="text-align: center; font-style: italic; margin-top: 10px; max-width: 900px; margin-left: auto; margin-right: auto;">We ran over 4000 scaling experiments on up to 512 GPUs and measured throughput (size of markers) and GPU utilization (color of markers). Note that both are normalized per model size in this visualization.</p>
47
+
 
 
 
 
 
48
  </div>
49
  </d-title>
50
  <d-byline></d-byline>
 
177
 
178
  <p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p>
179
 
180
+ <p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments (over 16k including test runs) with up to 512 GPUs to scan many possible distributed training layouts and model sizes. TODO: link to dataset too </p>
181
 
182
+ <iframe id="plotFrame" src="assets/data/benchmarks/benchmarks_interactive.html" scrolling="no" frameborder="0" height="840" width="720"></iframe>
 
 
 
 
 
 
 
183
 
184
  <p>As you can see, there’s a lot of ground to be covered. Before getting into the trenches of distributed training let’s take a quick high level look on we’ll cover in the post.</p>
185
 
src/index.html CHANGED
@@ -38,17 +38,13 @@
38
  </d-front-matter>
39
  <d-title>
40
  <h1 class="l-page" style="text-align: center;">The Ultra-Scale Playbook:<br>Training LLMs on GPU Clusters</h1>
41
- <div id="title-plot" class="main-plot-container l-screen">
42
  <iframe id="banner"
43
- src="assets/data/benchmarks/banner.html" scrolling="no" frameborder="0" loading="lazy" style="display: block; margin: 0 auto; position: relative;">
 
44
  </iframe>
45
- <script>
46
- window.addEventListener('load', function() {
47
- const frame = document.getElementById('banner');
48
- frame.style.height = frame.contentWindow.document.documentElement.scrollHeight + 'px';
49
- frame.style.width = frame.contentWindow.document.documentElement.scrollWidth + 'px';
50
- });
51
- </script>
52
  </div>
53
  </d-title>
54
  <d-byline></d-byline>
@@ -181,16 +177,9 @@
181
 
182
  <p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p>
183
 
184
- <p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments with up to 512 GPUs to scan many possible distributed training layouts and model sizes. TODO: link to dataset too </p>
185
 
186
- <iframe id="plotFrame" src="assets/data/benchmarks/benchmarks_interactive.html" width="90%" scrolling="no" frameborder="0"></iframe>
187
- <script>
188
- window.addEventListener('load', function() {
189
- const frame = document.getElementById('plotFrame');
190
- frame.style.height = frame.contentWindow.document.documentElement.scrollHeight + 'px';
191
- frame.style.width = frame.contentWindow.document.documentElement.scrollWidth + 'px';
192
- });
193
- </script>
194
 
195
  <p>As you can see, there’s a lot of ground to be covered. Before getting into the trenches of distributed training let’s take a quick high level look on we’ll cover in the post.</p>
196
 
 
38
  </d-front-matter>
39
  <d-title>
40
  <h1 class="l-page" style="text-align: center;">The Ultra-Scale Playbook:<br>Training LLMs on GPU Clusters</h1>
41
+ <div id="title-plot" class="main-plot-container l-screen" style="overflow-x: hidden; width: 100%; text-align: center;">
42
  <iframe id="banner"
43
+ src="assets/data/benchmarks/banner.html" scrolling="no" frameborder="0" width="1200"
44
+ height="675" loading="lazy" style="margin: 0 auto; position: relative;">
45
  </iframe>
46
+ <p style="text-align: center; font-style: italic; margin-top: 10px; max-width: 900px; margin-left: auto; margin-right: auto;">We ran over 4000 scaling experiments on up to 512 GPUs and measured throughput (size of markers) and GPU utilization (color of markers). Note that both are normalized per model size in this visualization.</p>
47
+
 
 
 
 
 
48
  </div>
49
  </d-title>
50
  <d-byline></d-byline>
 
177
 
178
  <p><img alt="Picotron implements each key concept in a self-contained way, such that the method can be studied separately and in isolation." src="assets/images/placeholder.png" /></p>
179
 
180
+ <p><strong>Real training efficiency benchmarks:</strong> Finally, how to <em>actually</em> scale your LLM training depends on your infrastructure, such as the kind of chips, interconnect etc., and we can’t give a single unified recipe. What we will give though is a way to benchmark several setups and it is what we have done on our cluster! We ran over 4100 distributed experiments (over 16k including test runs) with up to 512 GPUs to scan many possible distributed training layouts and model sizes. TODO: link to dataset too </p>
181
 
182
+ <iframe id="plotFrame" src="assets/data/benchmarks/benchmarks_interactive.html" scrolling="no" frameborder="0" height="840" width="720"></iframe>
 
 
 
 
 
 
 
183
 
184
  <p>As you can see, there’s a lot of ground to be covered. Before getting into the trenches of distributed training let’s take a quick high level look on we’ll cover in the post.</p>
185