Spaces:
Runtime error
Runtime error
Commit
·
3c81363
1
Parent(s):
9f2000a
training text
Browse files- static/tabs.html +37 -4
static/tabs.html
CHANGED
@@ -60,7 +60,7 @@ a:visited {
|
|
60 |
<div>
|
61 |
<!-- Nav tabs -->
|
62 |
<ul class="nav nav-tabs" role="tablist">
|
63 |
-
<li role="presentation" class="active"><a href="#tab1" aria-controls="tab1" role="tab" data-toggle="tab">
|
64 |
<li role="presentation"><a href="#tab2" aria-controls="tab2" role="tab" data-toggle="tab">Security</a></li>
|
65 |
<li role="presentation"><a href="#tab3" aria-controls="tab3" role="tab" data-toggle="tab">Make Your Own</a></li>
|
66 |
</ul>
|
@@ -68,9 +68,42 @@ a:visited {
|
|
68 |
<!-- Tab panes -->
|
69 |
<div class="tab-content">
|
70 |
<div role="tabpanel" class="tab-pane active" id="tab1">
|
71 |
-
<
|
72 |
-
|
73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
74 |
</div>
|
75 |
<div role="tabpanel" class="tab-pane" id="tab2">
|
76 |
<p>In this section, we discuss common concerns related to security of the collaborative training.</p>
|
|
|
60 |
<div>
|
61 |
<!-- Nav tabs -->
|
62 |
<ul class="nav nav-tabs" role="tablist">
|
63 |
+
<li role="presentation" class="active"><a href="#tab1" aria-controls="tab1" role="tab" data-toggle="tab">Memory-Efficient Training</a></li>
|
64 |
<li role="presentation"><a href="#tab2" aria-controls="tab2" role="tab" data-toggle="tab">Security</a></li>
|
65 |
<li role="presentation"><a href="#tab3" aria-controls="tab3" role="tab" data-toggle="tab">Make Your Own</a></li>
|
66 |
</ul>
|
|
|
68 |
<!-- Tab panes -->
|
69 |
<div class="tab-content">
|
70 |
<div role="tabpanel" class="tab-pane active" id="tab1">
|
71 |
+
<p>
|
72 |
+
Our aim is to train a large model in a decentralized fashion on consumer hardware or low-end cloud instances.
|
73 |
+
This means we need to make the model, dataset, and other memory buffers fit onto a few GB of disk, 12-16 GB of CPU RAM,
|
74 |
+
and 8-12 GB of GPU memory. Unfortunately, this rules out many popular techniques such as
|
75 |
+
<a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2101.06840">ZeRO-Offload</a>:
|
76 |
+
there is simply not enough RAM for that. Instead, we must make better use of what limited memory we have.
|
77 |
+
To do this, we use two techniques: 8-bit Optimizers for GPU memory and dataset streaming for RAM & HDD.
|
78 |
+
</p>
|
79 |
+
<p>
|
80 |
+
<b>8-bit Optimizers:</b>
|
81 |
+
Using optimizers such as LAMB or Adam requires four times as much GPU memory as simply storing model parameters (8 bytes vs 2 bytes).
|
82 |
+
As such, for training large models with many parameters the optimizers make up the largest chunk of memory.
|
83 |
+
With 8-bit optimizers this memory is reduced by 75% (2 bytes) making it much easier to fit large models onto consumer GPUs.
|
84 |
+
</p><p>
|
85 |
+
Naturally, we can combine this technique with offloading: storing 8-bit optimizer states in CPU memory rather
|
86 |
+
than GPU memory (0 bytes GPU, 2 bytes CPU). To perform an optimizer update, we transfer the GPU gradients
|
87 |
+
to the CPU, perform the optimizer update, and then transfer the updated weights to the GPU.
|
88 |
+
We can do this for each weight one-by-one so that additional CPU memory required for the optimizer update
|
89 |
+
is minimal.
|
90 |
+
The combination of offloading and 8-bit optimizers means that we conserve GPU memory (0 bytes per parameter)
|
91 |
+
and also use only a limited amount of CPU memory (2 bytes per parameter).
|
92 |
+
|
93 |
+
</p>
|
94 |
+
<p>
|
95 |
+
<b>Dataset Streaming</b>
|
96 |
+
Usually data is stored on disk and needs to be fully or partially loaded into CPU memory to be used for training.
|
97 |
+
Large datasets used for pre-training measure in <a href="https://arxiv.org/abs/2101.00027">hundreds of gigabytes</a> or even <a href="https://laion.ai/laion-400-open-dataset/">terabytes</a>.
|
98 |
+
This can pose a significant problem, as most desktop and cheap cloud instance simply do not have that much space.
|
99 |
+
Furthermore, downloading the dataset over the internet would take up hours before one can even begin training.
|
100 |
+
<!--Changing the dataset means downloading a new dataset in full and using additional disk space.-->
|
101 |
+
</p><p>
|
102 |
+
To circumvent these problems, we stream the training dataset in the same way as you stream online videos.
|
103 |
+
Participants download a small random portion of the training dataset and immediately begin training on it,
|
104 |
+
while additional data is loaded in background. As such, we can train a model with virtually no memory
|
105 |
+
overhead from the dataset and switching to a new dataset is as simple as changing an argument to the streamer class.
|
106 |
+
</p>
|
107 |
</div>
|
108 |
<div role="tabpanel" class="tab-pane" id="tab2">
|
109 |
<p>In this section, we discuss common concerns related to security of the collaborative training.</p>
|