Yeb Havinga
commited on
Commit
Β·
01d1b85
1
Parent(s):
7215bcc
Update
Browse files- index.html +3 -2
index.html
CHANGED
@@ -12,7 +12,7 @@
|
|
12 |
</head>
|
13 |
<body>
|
14 |
<div md-src-pos="0..29528">
|
15 |
-
<h1 md-src-pos="0..26">Dutch <!-- doesnt work on HF spaces?? span class="emoji">π³π± π§πͺ</span--> T5 models </h1>
|
16 |
<p md-src-pos="28..495"><span md-src-pos="28..64">A few months ago my access to Google</span>'<span md-src-pos="65..85">s TPU Research Cloud</span> (<span md-src-pos="87..90">TRC</span>) <span md-src-pos="92..104">was renewed.</span> <span md-src-pos="105..133">My goal was to train several</span> <span md-src-pos="134..168">Dutch and Dutch+English T5 models,</span> <span md-src-pos="169..227">limited to model sizes that can run on a single GPU.</span> <span md-src-pos="302..417">The T5 model architecture is a text to text encoder/decoder model architecture that is suitable for a wide range of</span> <span md-src-pos="418..439">possibly mixed tasks,</span> <span md-src-pos="440..495">where each task is formulated with one or text prompts.</span></p>
|
17 |
<ul md-src-pos="497..2062">
|
18 |
<li md-src-pos="497..751"><strong md-src-pos="499..624"><a target="_blank" href="https://arxiv.org/abs/1910.10683.pdf" md-src-pos="501..622">Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer</a></strong> by <em md-src-pos="628..750">Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu</em>.</li>
|
@@ -609,7 +609,8 @@
|
|
609 |
notes:</p>
|
610 |
<ul md-src-pos="2812..4929">
|
611 |
<li md-src-pos="2812..2869">Note: The <code md-src-pos="2824..2834">t5-small</code> model with 24 layers is not small.</li>
|
612 |
-
<li md-src-pos="2870..3120">Training with more layers is much slower than you'd expect from the increased model size. It is also more difficult to get batch size and learning rate right.
|
|
|
613 |
<li md-src-pos="3121..3339">The 'larger' models are not only harder to pre-train, but also harder to fine-tune. The optimizer eats up a lot of space, and the amount of memory required also depends on the length of source and target sequences.</li>
|
614 |
<li md-src-pos="3340..3446">When iterating over models and running evaluation, a sqlite database can be used to scribble results on.</li>
|
615 |
<li md-src-pos="3447..3602">PyCharm. Remote debugging from your workstation to either a TPU VM or your deep-learning workstation gives very good insight into the data structures.</li>
|
|
|
12 |
</head>
|
13 |
<body>
|
14 |
<div md-src-pos="0..29528">
|
15 |
+
<h1 md-src-pos="0..26">Pre-training Dutch <!-- doesnt work on HF spaces?? span class="emoji">π³π± π§πͺ</span--> T5 models </h1>
|
16 |
<p md-src-pos="28..495"><span md-src-pos="28..64">A few months ago my access to Google</span>'<span md-src-pos="65..85">s TPU Research Cloud</span> (<span md-src-pos="87..90">TRC</span>) <span md-src-pos="92..104">was renewed.</span> <span md-src-pos="105..133">My goal was to train several</span> <span md-src-pos="134..168">Dutch and Dutch+English T5 models,</span> <span md-src-pos="169..227">limited to model sizes that can run on a single GPU.</span> <span md-src-pos="302..417">The T5 model architecture is a text to text encoder/decoder model architecture that is suitable for a wide range of</span> <span md-src-pos="418..439">possibly mixed tasks,</span> <span md-src-pos="440..495">where each task is formulated with one or text prompts.</span></p>
|
17 |
<ul md-src-pos="497..2062">
|
18 |
<li md-src-pos="497..751"><strong md-src-pos="499..624"><a target="_blank" href="https://arxiv.org/abs/1910.10683.pdf" md-src-pos="501..622">Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer</a></strong> by <em md-src-pos="628..750">Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu</em>.</li>
|
|
|
609 |
notes:</p>
|
610 |
<ul md-src-pos="2812..4929">
|
611 |
<li md-src-pos="2812..2869">Note: The <code md-src-pos="2824..2834">t5-small</code> model with 24 layers is not small.</li>
|
612 |
+
<li md-src-pos="2870..3120">Training with more layers is much slower than you'd expect from the increased model size. It is also more difficult to get batch size and learning rate right.
|
613 |
+
See e.g. the section about finding the right hyperparameters for the base-36L training.</li>
|
614 |
<li md-src-pos="3121..3339">The 'larger' models are not only harder to pre-train, but also harder to fine-tune. The optimizer eats up a lot of space, and the amount of memory required also depends on the length of source and target sequences.</li>
|
615 |
<li md-src-pos="3340..3446">When iterating over models and running evaluation, a sqlite database can be used to scribble results on.</li>
|
616 |
<li md-src-pos="3447..3602">PyCharm. Remote debugging from your workstation to either a TPU VM or your deep-learning workstation gives very good insight into the data structures.</li>
|