Spaces:

yhavinga
/

pre-training-dutch-t5-models

Sleeping

App Files Files Community

Yeb Havinga commited on Jun 14, 2022

Commit

01d1b85

1 Parent(s): 7215bcc

Update

Browse files

Files changed (1) hide show

index.html +3 -2

index.html CHANGED Viewed

@@ -12,7 +12,7 @@
  </head>
  <body>
   <div md-src-pos="0..29528">
-   <h1 md-src-pos="0..26">Dutch <!-- doesnt work on HF spaces?? span class="emoji">🇳🇱 🇧🇪</span--> T5 models </h1>
    <p md-src-pos="28..495"><span md-src-pos="28..64">A few months ago my access to Google</span>'<span md-src-pos="65..85">s TPU Research Cloud</span> (<span md-src-pos="87..90">TRC</span>) <span md-src-pos="92..104">was renewed.</span> <span md-src-pos="105..133">My goal was to train several</span> <span md-src-pos="134..168">Dutch and Dutch+English T5 models,</span> <span md-src-pos="169..227">limited to model sizes that can run on a single GPU.</span> <span md-src-pos="302..417">The T5 model architecture is a text to text encoder/decoder model architecture that is suitable for a wide range of</span> <span md-src-pos="418..439">possibly mixed tasks,</span> <span md-src-pos="440..495">where each task is formulated with one or text prompts.</span></p>
    <ul md-src-pos="497..2062">
     <li md-src-pos="497..751"><strong md-src-pos="499..624"><a target="_blank" href="https://arxiv.org/abs/1910.10683.pdf" md-src-pos="501..622">Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer</a></strong> by <em md-src-pos="628..750">Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu</em>.</li>
@@ -609,7 +609,8 @@
    notes:</p>
    <ul md-src-pos="2812..4929">
     <li md-src-pos="2812..2869">Note: The <code md-src-pos="2824..2834">t5-small</code> model with 24 layers is not small.</li>
-    <li md-src-pos="2870..3120">Training with more layers is much slower than you'd expect from the increased model size. It is also more difficult to get batch size and learning rate right. Below is a section about finding the right hyperparameters for the base-36L training.</li>
     <li md-src-pos="3121..3339">The 'larger' models are not only harder to pre-train, but also harder to fine-tune. The optimizer eats up a lot of space, and the amount of memory required also depends on the length of source and target sequences.</li>
     <li md-src-pos="3340..3446">When iterating over models and running evaluation, a sqlite database can be used to scribble results on.</li>
     <li md-src-pos="3447..3602">PyCharm. Remote debugging from your workstation to either a TPU VM or your deep-learning workstation gives very good insight into the data structures.</li>

  </head>
  <body>
   <div md-src-pos="0..29528">
+   <h1 md-src-pos="0..26">Pre-training Dutch <!-- doesnt work on HF spaces?? span class="emoji">🇳🇱 🇧🇪</span--> T5 models </h1>
    <p md-src-pos="28..495"><span md-src-pos="28..64">A few months ago my access to Google</span>'<span md-src-pos="65..85">s TPU Research Cloud</span> (<span md-src-pos="87..90">TRC</span>) <span md-src-pos="92..104">was renewed.</span> <span md-src-pos="105..133">My goal was to train several</span> <span md-src-pos="134..168">Dutch and Dutch+English T5 models,</span> <span md-src-pos="169..227">limited to model sizes that can run on a single GPU.</span> <span md-src-pos="302..417">The T5 model architecture is a text to text encoder/decoder model architecture that is suitable for a wide range of</span> <span md-src-pos="418..439">possibly mixed tasks,</span> <span md-src-pos="440..495">where each task is formulated with one or text prompts.</span></p>
    <ul md-src-pos="497..2062">
     <li md-src-pos="497..751"><strong md-src-pos="499..624"><a target="_blank" href="https://arxiv.org/abs/1910.10683.pdf" md-src-pos="501..622">Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer</a></strong> by <em md-src-pos="628..750">Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu</em>.</li>
    notes:</p>
    <ul md-src-pos="2812..4929">
     <li md-src-pos="2812..2869">Note: The <code md-src-pos="2824..2834">t5-small</code> model with 24 layers is not small.</li>
+    <li md-src-pos="2870..3120">Training with more layers is much slower than you'd expect from the increased model size. It is also more difficult to get batch size and learning rate right.
+    See e.g. the section about finding the right hyperparameters for the base-36L training.</li>
     <li md-src-pos="3121..3339">The 'larger' models are not only harder to pre-train, but also harder to fine-tune. The optimizer eats up a lot of space, and the amount of memory required also depends on the length of source and target sequences.</li>
     <li md-src-pos="3340..3446">When iterating over models and running evaluation, a sqlite database can be used to scribble results on.</li>
     <li md-src-pos="3447..3602">PyCharm. Remote debugging from your workstation to either a TPU VM or your deep-learning workstation gives very good insight into the data structures.</li>