Yeb Havinga commited on
Commit
01d1b85
Β·
1 Parent(s): 7215bcc
Files changed (1) hide show
  1. index.html +3 -2
index.html CHANGED
@@ -12,7 +12,7 @@
12
  </head>
13
  <body>
14
  <div md-src-pos="0..29528">
15
- <h1 md-src-pos="0..26">Dutch <!-- doesnt work on HF spaces?? span class="emoji">πŸ‡³πŸ‡± πŸ‡§πŸ‡ͺ</span--> T5 models </h1>
16
  <p md-src-pos="28..495"><span md-src-pos="28..64">A few months ago my access to Google</span>'<span md-src-pos="65..85">s TPU Research Cloud</span> (<span md-src-pos="87..90">TRC</span>) <span md-src-pos="92..104">was renewed.</span> <span md-src-pos="105..133">My goal was to train several</span> <span md-src-pos="134..168">Dutch and Dutch+English T5 models,</span> <span md-src-pos="169..227">limited to model sizes that can run on a single GPU.</span> <span md-src-pos="302..417">The T5 model architecture is a text to text encoder/decoder model architecture that is suitable for a wide range of</span> <span md-src-pos="418..439">possibly mixed tasks,</span> <span md-src-pos="440..495">where each task is formulated with one or text prompts.</span></p>
17
  <ul md-src-pos="497..2062">
18
  <li md-src-pos="497..751"><strong md-src-pos="499..624"><a target="_blank" href="https://arxiv.org/abs/1910.10683.pdf" md-src-pos="501..622">Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer</a></strong> by <em md-src-pos="628..750">Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu</em>.</li>
@@ -609,7 +609,8 @@
609
  notes:</p>
610
  <ul md-src-pos="2812..4929">
611
  <li md-src-pos="2812..2869">Note: The <code md-src-pos="2824..2834">t5-small</code> model with 24 layers is not small.</li>
612
- <li md-src-pos="2870..3120">Training with more layers is much slower than you'd expect from the increased model size. It is also more difficult to get batch size and learning rate right. Below is a section about finding the right hyperparameters for the base-36L training.</li>
 
613
  <li md-src-pos="3121..3339">The 'larger' models are not only harder to pre-train, but also harder to fine-tune. The optimizer eats up a lot of space, and the amount of memory required also depends on the length of source and target sequences.</li>
614
  <li md-src-pos="3340..3446">When iterating over models and running evaluation, a sqlite database can be used to scribble results on.</li>
615
  <li md-src-pos="3447..3602">PyCharm. Remote debugging from your workstation to either a TPU VM or your deep-learning workstation gives very good insight into the data structures.</li>
 
12
  </head>
13
  <body>
14
  <div md-src-pos="0..29528">
15
+ <h1 md-src-pos="0..26">Pre-training Dutch <!-- doesnt work on HF spaces?? span class="emoji">πŸ‡³πŸ‡± πŸ‡§πŸ‡ͺ</span--> T5 models </h1>
16
  <p md-src-pos="28..495"><span md-src-pos="28..64">A few months ago my access to Google</span>'<span md-src-pos="65..85">s TPU Research Cloud</span> (<span md-src-pos="87..90">TRC</span>) <span md-src-pos="92..104">was renewed.</span> <span md-src-pos="105..133">My goal was to train several</span> <span md-src-pos="134..168">Dutch and Dutch+English T5 models,</span> <span md-src-pos="169..227">limited to model sizes that can run on a single GPU.</span> <span md-src-pos="302..417">The T5 model architecture is a text to text encoder/decoder model architecture that is suitable for a wide range of</span> <span md-src-pos="418..439">possibly mixed tasks,</span> <span md-src-pos="440..495">where each task is formulated with one or text prompts.</span></p>
17
  <ul md-src-pos="497..2062">
18
  <li md-src-pos="497..751"><strong md-src-pos="499..624"><a target="_blank" href="https://arxiv.org/abs/1910.10683.pdf" md-src-pos="501..622">Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer</a></strong> by <em md-src-pos="628..750">Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu</em>.</li>
 
609
  notes:</p>
610
  <ul md-src-pos="2812..4929">
611
  <li md-src-pos="2812..2869">Note: The <code md-src-pos="2824..2834">t5-small</code> model with 24 layers is not small.</li>
612
+ <li md-src-pos="2870..3120">Training with more layers is much slower than you'd expect from the increased model size. It is also more difficult to get batch size and learning rate right.
613
+ See e.g. the section about finding the right hyperparameters for the base-36L training.</li>
614
  <li md-src-pos="3121..3339">The 'larger' models are not only harder to pre-train, but also harder to fine-tune. The optimizer eats up a lot of space, and the amount of memory required also depends on the length of source and target sequences.</li>
615
  <li md-src-pos="3340..3446">When iterating over models and running evaluation, a sqlite database can be used to scribble results on.</li>
616
  <li md-src-pos="3447..3602">PyCharm. Remote debugging from your workstation to either a TPU VM or your deep-learning workstation gives very good insight into the data structures.</li>