Spaces:

nanotron
/

ultrascale-playbook

Running

App Files Files Community

114

lvwerra HF Staff commited on Feb 19

Commit

c767d20

1 Parent(s): 27abecd

minor changes

Browse files

Files changed (3) hide show

dist/assets/images/memorycoalescing.png +2 -2
dist/index.html +4 -2
src/index.html +4 -2

dist/assets/images/memorycoalescing.png CHANGED Viewed

Git LFS Details

SHA256: 1094fe9aeb953c743791445ee6d7e73a5a89fa85fe60f4312266d1265e7c591a
Pointer size: 130 Bytes
Size of remote file: 94.1 kB

Git LFS Details

SHA256: 088cd848100ab26abbffdcc7c0e8f18a83facd0a8637c460e3ac88d483b04b46
Pointer size: 130 Bytes
Size of remote file: 94.1 kB

dist/index.html CHANGED Viewed

@@ -82,6 +82,8 @@
         <p>As the size of the clusters used to train these models grew, various techniques such as data parallelism, tensor parallelism, pipeline parallelism or context parallelism as well as ZeRO or kernel fusion have been invented to makes sure that GPUs are highly utilized at all times. This significantly reduces training time and makes the best use of this expensive hardware. Even more, as the challenge of scaling up AI training goes beyond just building the initial models and teams have found that fine-tuning large models on specialized data often produces the best results, generally involving the same distributed training techniques. In this book we'll progressively go over all of these techniques –from the simplest to the most raffined one– while keeping a single story-line to understand where each method comes from.</p>
         <p>We'll assumes you have some simple basic knowledge about current LLM architecture and are roughtly familiar with how deep learning model are trained, but you can be generally new to distributed training. If needed, the basics of model training can be found in great courses found at <a href="https://www.deeplearning.ai">DeepLearning.ai</a> or on the <a href="https://pytorch.org/tutorials/beginner/basics/intro.html">PyTorch tutorial sections</a>. This book can be seen as the second part of a trilogy following our first blog on processing data for pre-training, the so-called “<a href="https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1">FineWeb blog post</a>”. Having read both blog posts, you should have almost all the core knowledge needed to fully understand how how performing LLMs are being built nowadays, just missing some final spices regarding data mixing and architecture choices to complete the recipe (stay tuned for part three…).</p>
         <aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team for creating
@@ -90,7 +92,7 @@
         <p>The book is built on the following <strong>three general foundations</strong>:</p>
         <p><strong>Quick intros on theory and concepts:</strong> before diving into code and experiments, we want to understand how each method works at a high level and what it’s advantages and limits are. You’ll learn about which parts of a language model eat away your memory and when during training it happens. You’ll learn how we can solve memory constraints by parallelizing the models and increase the throughput by scaling up GPUs. As a result you'll understand how the following widget to compute the memory breakdown of a transformer model works: </p>
-        <aside>Note that we're still missing Pipeline Parallelism in this widget. To be added as exercise for the reader.</aside>
         <div class="large-image-background-transparent">
             <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
@@ -387,7 +389,7 @@
         <h4>Profiling the memory usage</h4>
-        <p>Using this snippet, we can understand how memory is allocated throughout training. We can see that memory utilization is not a static thing but varies a lot during training and during a training step:</p>
         <aside>Check out <a target="_self" href="#a1%3A_distributed_training_profiling" class="">A1: Distributed Training Profiling</a> for a walkthrough how to profile your model.</aside>

         <p>As the size of the clusters used to train these models grew, various techniques such as data parallelism, tensor parallelism, pipeline parallelism or context parallelism as well as ZeRO or kernel fusion have been invented to makes sure that GPUs are highly utilized at all times. This significantly reduces training time and makes the best use of this expensive hardware. Even more, as the challenge of scaling up AI training goes beyond just building the initial models and teams have found that fine-tuning large models on specialized data often produces the best results, generally involving the same distributed training techniques. In this book we'll progressively go over all of these techniques –from the simplest to the most raffined one– while keeping a single story-line to understand where each method comes from.</p>
+        <aside>If you have questions or remarks open a discussion on the <a href="https://huggingface.co/spaces/nanotron/ultrascale-playbook/discussions?status=open&type=discussion">Community tab</a>!</aside>
         <p>We'll assumes you have some simple basic knowledge about current LLM architecture and are roughtly familiar with how deep learning model are trained, but you can be generally new to distributed training. If needed, the basics of model training can be found in great courses found at <a href="https://www.deeplearning.ai">DeepLearning.ai</a> or on the <a href="https://pytorch.org/tutorials/beginner/basics/intro.html">PyTorch tutorial sections</a>. This book can be seen as the second part of a trilogy following our first blog on processing data for pre-training, the so-called “<a href="https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1">FineWeb blog post</a>”. Having read both blog posts, you should have almost all the core knowledge needed to fully understand how how performing LLMs are being built nowadays, just missing some final spices regarding data mixing and architecture choices to complete the recipe (stay tuned for part three…).</p>
         <aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team for creating
         <p>The book is built on the following <strong>three general foundations</strong>:</p>
         <p><strong>Quick intros on theory and concepts:</strong> before diving into code and experiments, we want to understand how each method works at a high level and what it’s advantages and limits are. You’ll learn about which parts of a language model eat away your memory and when during training it happens. You’ll learn how we can solve memory constraints by parallelizing the models and increase the throughput by scaling up GPUs. As a result you'll understand how the following widget to compute the memory breakdown of a transformer model works: </p>
+        <aside>Note that we're still missing Pipeline Parallelism in this widget. To be added as an exercise for the reader.</aside>
         <div class="large-image-background-transparent">
             <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
         <h4>Profiling the memory usage</h4>
+        <p>Using the Pytorch profiler we can understand how memory is allocated througho    ut training. We can see that memory utilization is not a static thing but varies a lot during training and during a training step:</p>
         <aside>Check out <a target="_self" href="#a1%3A_distributed_training_profiling" class="">A1: Distributed Training Profiling</a> for a walkthrough how to profile your model.</aside>

src/index.html CHANGED Viewed

@@ -82,6 +82,8 @@
         <p>As the size of the clusters used to train these models grew, various techniques such as data parallelism, tensor parallelism, pipeline parallelism or context parallelism as well as ZeRO or kernel fusion have been invented to makes sure that GPUs are highly utilized at all times. This significantly reduces training time and makes the best use of this expensive hardware. Even more, as the challenge of scaling up AI training goes beyond just building the initial models and teams have found that fine-tuning large models on specialized data often produces the best results, generally involving the same distributed training techniques. In this book we'll progressively go over all of these techniques –from the simplest to the most raffined one– while keeping a single story-line to understand where each method comes from.</p>
         <p>We'll assumes you have some simple basic knowledge about current LLM architecture and are roughtly familiar with how deep learning model are trained, but you can be generally new to distributed training. If needed, the basics of model training can be found in great courses found at <a href="https://www.deeplearning.ai">DeepLearning.ai</a> or on the <a href="https://pytorch.org/tutorials/beginner/basics/intro.html">PyTorch tutorial sections</a>. This book can be seen as the second part of a trilogy following our first blog on processing data for pre-training, the so-called “<a href="https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1">FineWeb blog post</a>”. Having read both blog posts, you should have almost all the core knowledge needed to fully understand how how performing LLMs are being built nowadays, just missing some final spices regarding data mixing and architecture choices to complete the recipe (stay tuned for part three…).</p>
         <aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team for creating
@@ -90,7 +92,7 @@
         <p>The book is built on the following <strong>three general foundations</strong>:</p>
         <p><strong>Quick intros on theory and concepts:</strong> before diving into code and experiments, we want to understand how each method works at a high level and what it’s advantages and limits are. You’ll learn about which parts of a language model eat away your memory and when during training it happens. You’ll learn how we can solve memory constraints by parallelizing the models and increase the throughput by scaling up GPUs. As a result you'll understand how the following widget to compute the memory breakdown of a transformer model works: </p>
-        <aside>Note that we're still missing Pipeline Parallelism in this widget. To be added as exercise for the reader.</aside>
         <div class="large-image-background-transparent">
             <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
@@ -387,7 +389,7 @@
         <h4>Profiling the memory usage</h4>
-        <p>Using this snippet, we can understand how memory is allocated throughout training. We can see that memory utilization is not a static thing but varies a lot during training and during a training step:</p>
         <aside>Check out <a target="_self" href="#a1%3A_distributed_training_profiling" class="">A1: Distributed Training Profiling</a> for a walkthrough how to profile your model.</aside>

         <p>As the size of the clusters used to train these models grew, various techniques such as data parallelism, tensor parallelism, pipeline parallelism or context parallelism as well as ZeRO or kernel fusion have been invented to makes sure that GPUs are highly utilized at all times. This significantly reduces training time and makes the best use of this expensive hardware. Even more, as the challenge of scaling up AI training goes beyond just building the initial models and teams have found that fine-tuning large models on specialized data often produces the best results, generally involving the same distributed training techniques. In this book we'll progressively go over all of these techniques –from the simplest to the most raffined one– while keeping a single story-line to understand where each method comes from.</p>
+        <aside>If you have questions or remarks open a discussion on the <a href="https://huggingface.co/spaces/nanotron/ultrascale-playbook/discussions?status=open&type=discussion">Community tab</a>!</aside>
         <p>We'll assumes you have some simple basic knowledge about current LLM architecture and are roughtly familiar with how deep learning model are trained, but you can be generally new to distributed training. If needed, the basics of model training can be found in great courses found at <a href="https://www.deeplearning.ai">DeepLearning.ai</a> or on the <a href="https://pytorch.org/tutorials/beginner/basics/intro.html">PyTorch tutorial sections</a>. This book can be seen as the second part of a trilogy following our first blog on processing data for pre-training, the so-called “<a href="https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1">FineWeb blog post</a>”. Having read both blog posts, you should have almost all the core knowledge needed to fully understand how how performing LLMs are being built nowadays, just missing some final spices regarding data mixing and architecture choices to complete the recipe (stay tuned for part three…).</p>
         <aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team for creating
         <p>The book is built on the following <strong>three general foundations</strong>:</p>
         <p><strong>Quick intros on theory and concepts:</strong> before diving into code and experiments, we want to understand how each method works at a high level and what it’s advantages and limits are. You’ll learn about which parts of a language model eat away your memory and when during training it happens. You’ll learn how we can solve memory constraints by parallelizing the models and increase the throughput by scaling up GPUs. As a result you'll understand how the following widget to compute the memory breakdown of a transformer model works: </p>
+        <aside>Note that we're still missing Pipeline Parallelism in this widget. To be added as an exercise for the reader.</aside>
         <div class="large-image-background-transparent">
             <div style="display: grid; grid-template-columns: 1fr 1fr; align-items: center;">
         <h4>Profiling the memory usage</h4>
+        <p>Using the Pytorch profiler we can understand how memory is allocated througho    ut training. We can see that memory utilization is not a static thing but varies a lot during training and during a training step:</p>
         <aside>Check out <a target="_self" href="#a1%3A_distributed_training_profiling" class="">A1: Distributed Training Profiling</a> for a walkthrough how to profile your model.</aside>