Yeb Havinga commited on
Commit
d71958a
Β·
1 Parent(s): 765d820
Files changed (1) hide show
  1. index.html +152 -132
index.html CHANGED
@@ -13,7 +13,10 @@
13
  <body>
14
  <div md-src-pos="0..29528">
15
  <h1 md-src-pos="0..26">Pre-training Dutch <!-- doesnt work on HF spaces?? span class="emoji">πŸ‡³πŸ‡± πŸ‡§πŸ‡ͺ</span--> T5 models </h1>
16
- <p md-src-pos="28..495"><span md-src-pos="28..64">A few months ago my access to Google</span>'<span md-src-pos="65..85">s TPU Research Cloud</span> (<span md-src-pos="87..90">TRC</span>) <span md-src-pos="92..104">was renewed.</span> <span md-src-pos="105..133">My goal was to train several</span> <span md-src-pos="134..168">Dutch and Dutch+English T5 models,</span> <span md-src-pos="169..227">limited to model sizes that can run on a single GPU.</span> <span md-src-pos="302..417">The T5 model architecture is a text to text encoder/decoder model architecture that is suitable for a wide range of</span> <span md-src-pos="418..439">possibly mixed tasks,</span> <span md-src-pos="440..495">where each task is formulated with one or text prompts.</span></p>
 
 
 
17
  <ul md-src-pos="497..2062">
18
  <li md-src-pos="497..751"><strong md-src-pos="499..624"><a target="_blank" href="https://arxiv.org/abs/1910.10683.pdf" md-src-pos="501..622">Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer</a></strong> by <em md-src-pos="628..750">Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu</em>.</li>
19
  <li md-src-pos="752..1482"><strong md-src-pos="754..859"><a target="_blank" href="https://arxiv.org/abs/2110.08207" md-src-pos="756..857">Multitask Prompted Training Enables Zero-Shot Task Generalization</a></strong> by <em md-src-pos="863..1481">Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf, Alexander M. Rush</em>.</li>
@@ -26,11 +29,156 @@
26
  <li md-src-pos="2203..2305"><a target="_blank" href="https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104" md-src-pos="2205..2305">https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104</a></li>
27
  <li md-src-pos="2306..2407"><a target="_blank" href="https://github.com/huggingface/transformers/tree/main/examples/research_projects/jax-projects#talks" md-src-pos="2308..2407">https://github.com/huggingface/transformers/tree/main/examples/research_projects/jax-projects#talks</a></li>
28
  </ul>
29
- <p md-src-pos="2409..2753"><span md-src-pos="2409..2463">While T5 is mainly researched for multi-task training,</span> <span md-src-pos="2464..2465">I</span>'<span md-src-pos="2466..2520">ve only pre-trained using the masked language modeling</span> <span md-src-pos="2521..2531">objective,</span> <span md-src-pos="2532..2572">and fine-tuned with at most two prompts.</span> <span md-src-pos="2573..2641">My main goal was to provide pre-trained Dutch only and Dutch+English</span> <span md-src-pos="2642..2649">models,</span> <span md-src-pos="2650..2682">for later experimenting with any</span> (<span md-src-pos="2684..2690">set of</span>) <span md-src-pos="2692..2731">tasks that can be formulated in a Dutch</span> (<span md-src-pos="2733..2743">or English</span>) <span md-src-pos="2745..2752">prompt.</span></p>
30
- <h2 md-src-pos="4931..4979">Pre-trained Dutch and Dutch+English T5 models</h2>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  <p md-src-pos="4981..5522"><span md-src-pos="4981..5024">Three types of T5 models have been trained.</span> <code md-src-pos="5025..5040">t5-base-dutch</code> <span md-src-pos="5041..5086">is the only model with an original T5 config.</span> <span md-src-pos="5087..5132">The other model types t5-v1.1 and t5-eff have</span> <code md-src-pos="5133..5145">gated-relu</code> <span md-src-pos="5146..5156">instead of</span> <code md-src-pos="5157..5163">relu</code> <span md-src-pos="5164..5187">as activation function,</span> <span md-src-pos="5188..5218">and trained with a drop-out of</span> <code md-src-pos="5219..5224">0.0</code> <span md-src-pos="5225..5254">unless training would diverge</span> (<code md-src-pos="5256..5283">t5-v1.1-large-dutch-cased</code>)<span md-src-pos="5284..5285">.</span> <span md-src-pos="5286..5353">The T5-eff models are models that differ in their number of layers.</span> <span md-src-pos="5354..5373">The table will list</span> <span md-src-pos="5374..5413">the several dimensions of these models.</span> <span md-src-pos="5414..5450">Not all t5-eff models are efficient,</span> <span md-src-pos="5451..5489">the best example being the inefficient</span> <code md-src-pos="5490..5520">t5-xl-4L-dutch-english-cased</code><span md-src-pos="5520..5521">.</span></p>
32
  <table md-src-pos="5524..14583">
33
  <thead>
 
34
  <tr md-src-pos="5524..6614">
35
  <th md-src-pos="5525..5544"></th>
36
  <th md-src-pos="5545..5611"><a target="_blank" href="https://huggingface.co/yhavinga/t5-base-dutch" md-src-pos="5546..5608">t5-base-dutch</a></th>
@@ -498,136 +646,8 @@
498
  </tr>
499
  </tbody>
500
  </table>
501
- <h2 md-src-pos="18893..18908">Pre-training</h2>
502
- <h3 md-src-pos="18910..18925">mC4 dataset</h3>
503
- <p md-src-pos="18927..20023"><span md-src-pos="18927..18949">A few weeks before the</span> '<span md-src-pos="18951..18972">21 Hackathon started,</span> <span
504
- md-src-pos="18973..19017">the multilingual C4 TensorFlow dataset (created by the
505
- original T5 authors) was prepared by AllenNLP and</span> <a target="_blank" href="https://huggingface.co/datasets/allenai/c4" md-src-pos="19018..19086">released on the HF hub</a><span md-src-pos="19086..19087">.</span> <span md-src-pos="19088..19134">In the hackathon we cleaned Dutch mC4 with the</span> <a target="_blank" href="https://gitlab.com/yhavinga/c4nlpreproc" md-src-pos="19135..19190">code adapted</a> <span md-src-pos="19191..19202">from the C4</span> <span
506
- md-src-pos="19203..19222">TensorFlow dataset,</span> <span md-src-pos="19223..19266">and used the resulting text files directly.</span> <span md-src-pos="19267..19315">We also verified that Dutch C4 was deduplicated.</span> <span md-src-pos="19316..19413">To be able to easily reuse this dataset for more pre-training sessions with Huggingfaces scripts,</span> <span md-src-pos="19414..19427">a Huggingface</span> <span md-src-pos="19428..19447">dataset was created</span>: <a target="_blank" href="https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned" md-src-pos="19449..19522">mc4_nl_cleaned</a><span md-src-pos="19522..19523">.</span> <span md-src-pos="19524..19555">For Dutch and English training,</span> <span md-src-pos="19556..19623">a couple of additional configs were added to the generation script,</span> <span md-src-pos="19624..19671">to produce interleaved Dutch and English texts.</span> <span md-src-pos="19672..19685">For instance,</span> <span md-src-pos="19686..19689">the</span> <a target="_blank" href="https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned/viewer/micro_en_nl/train" md-src-pos="19690..19792">micro_en_nl config</a> <span md-src-pos="19793..19826">mixes Dutch with English samples.</span> <span md-src-pos="19827..19884">Since the cleaned English C4 data is about 5 times larger</span> (<span md-src-pos="19886..19905">in compressed bytes</span>) <span md-src-pos="19907..19927">than the Dutch part,</span> <span md-src-pos="19928..19947">the mixed Dutch and</span> <span md-src-pos="19948..19966">English configs in</span> <code md-src-pos="19967..19983">mc4_nl_cleaned</code> <span md-src-pos="19984..20023">do not contain the complete English C4.</span></p>
507
- <p md-src-pos="20025..20161"><span md-src-pos="20025..20069">The full cleaned Dutch mC4 dataset is 151GB,</span> <span md-src-pos="20070..20082">and still is</span> (<span md-src-pos="20084..20088">June</span> '<span md-src-pos="20090..20092">22</span>) <span md-src-pos="20094..20136">the largest Dutch cleaned corpus currently</span> <span md-src-pos="20137..20161">available on the HF Hub.</span></p>
508
- <h3 md-src-pos="20163..20243">One epoch over the complete dataset, or multiple epochs on a smaller config?</h3>
509
- <p md-src-pos="20245..21303"><span md-src-pos="20245..20326">Something I noticed with the Flax mlm pretraining script I was using at the time,</span> <span md-src-pos="20327..20360">that the per-batch training speed</span> <span md-src-pos="20361..20440">seemed slower at the beginning of epochs when a larger dataset config was used.</span> <span md-src-pos="20441..20446">Also,</span> <span md-src-pos="20447..20464">on large configs,</span> <span md-src-pos="20465..20480">batch shuffling</span> <span md-src-pos="20481..20523">would fail with a TPU out-of-memory error.</span> <span md-src-pos="20524..20615">For these reasons I started experimenting with training for more epochs on smaller configs.</span> <span md-src-pos="20616..20634">This should be ok.</span> <span md-src-pos="20635..20717">In the original T5 paper downstream performance was compared between training on 2</span><sup><span md-src-pos="20722..20724">35</span></sup> <span md-src-pos="20731..20749">tokens vs training</span> <span md-src-pos="20750..20784">multiple epochs on a smaller part.</span> <span md-src-pos="20785..20800">64 repeats of 2</span><sup><span md-src-pos="20805..20807">29</span></sup> <span md-src-pos="20814..20871">tokens did not result in degraded downstream performance.</span> <span md-src-pos="20872..20881">The model</span> <code md-src-pos="20882..20925">yhavinga/t5-v1_1-base-dutch-english-cased</code> <span md-src-pos="20926..20943">is trained on the</span> <code md-src-pos="20944..20951">small</code> <span md-src-pos="20952..20973">config for 10 epochs.</span> <span md-src-pos="20974..20985">In the end,</span> <span md-src-pos="20986..21001">a change to the</span> <a target="_blank" href="https://github.com/huggingface/transformers/blame/main/examples/flax/language-modeling/run_t5_mlm_flax.py" md-src-pos="21002..21130">pre-training script</a> <span md-src-pos="21131..21157">to perform batch shuffling</span> (<span md-src-pos="21159..21177">permuting an array</span>) <span md-src-pos="21179..21189">on the CPU</span> <span md-src-pos="21190..21250">instead of the accelerator device solved all related issues,</span> <span md-src-pos="21251..21303">and larger configs could be used without any issues.</span></p>
510
- <h3 md-src-pos="21305..21338">Which optimizer and lr to use</h3>
511
- <p md-src-pos="21340..22064"><span md-src-pos="21340..21346">In the</span> '<span md-src-pos="21348..21428">21 Flax hackathon we quickly decided on using Adafactor with learning rate 5e-3.</span> <span md-src-pos="21429..21460">I was sure that with more time,</span> <span md-src-pos="21461..21493">a better setting could be found.</span> <span md-src-pos="21494..21535">After performing 7 sweeps with Adafactor,</span> <span md-src-pos="21536..21565">AdamW and Distributed Shampoo</span> (<span md-src-pos="21567..21579">experimental</span> <span md-src-pos="21580..21609">PJIT version from Dall-E mini</span>)<span md-src-pos="21610..21611">,</span> <span md-src-pos="21612..21646">I gave up to find better settings.</span> <span md-src-pos="21647..21705">The graph below shows the runs from all 7 sweeps combined.</span> <span md-src-pos="21706..21731">Apologies for the legend,</span> <span md-src-pos="21732..21774">I cannot show the optimizer in the legend,</span> <span md-src-pos="21775..21843">because the initial version of the training script had the optimizer</span> <code md-src-pos="21844..21857">--adafactor</code> <span md-src-pos="21858..21860">as</span> <span md-src-pos="21861..21869">boolean,</span> <span md-src-pos="21870..21928">which I later changed to a string with the optimizer name.</span> <span md-src-pos="21929..21986">All runs in the graph below that get the loss below 4 use</span> <strong md-src-pos="21987..22000">Adafactor</strong><span md-src-pos="22000..22001">.</span> <span md-src-pos="22002..22054">Peach-sweep-6 is dashed orange and has learning rate</span> <strong md-src-pos="22055..22063">5e-3</strong><span md-src-pos="22063..22064">.</span></p>
512
- <p md-src-pos="22066..22129"><img src="adafactor_vs_adam_pretrain.png" alt="Adafactor vs Adam vs Shampoo" __idea-generated="true" md-src-pos="22066..22129" data-original-src="file:/home/yeb/Developer/yhavinga/nedd_x/app/adafactor_vs_adam_pretrain.png"></p>
513
- <p md-src-pos="22131..22458"><span md-src-pos="22131..22235">While there probably is a setting that will allow Adam and Shampoo to also converge fast below loss 4.0,</span> <span md-src-pos="22236..22248">I was unable</span> <span md-src-pos="22249..22260">to find it.</span> <span md-src-pos="22261..22322">In a recent tweet Lucas Nestler had more success with Shampoo</span> (<a target="_blank" href="https://twitter.com/_clashluke/status/1535994026876252160" md-src-pos="22324..22381">https://twitter.com/_clashluke/status/1535994026876252160</a>) <span md-src-pos="22383..22458">so maybe I need to revisit the attempt with the latest upstream code bases.</span></p>
514
- <h3 md-src-pos="22460..22508">Bfloat16 datatype and learning rate schedule</h3>
515
- <p md-src-pos="22510..23055"><span md-src-pos="22510..22588">I had some additional options in the pre-training script that I wanted to use.</span> <span md-src-pos="22589..22623">An exponential decay learning rate</span> <span md-src-pos="22624..22684">schedule would allow me to pre-train for as long as desired,</span> <span md-src-pos="22685..22720">instead of a fixed number of steps.</span> <span md-src-pos="22721..22764">I was also keen to pre-train with bfloat16,</span> <span md-src-pos="22765..22808">for the reduced memory footprint and speed.</span> <span md-src-pos="22809..22821">This failed.</span> <span md-src-pos="22822..22901">The graph below shows different attempts with the legend showing the optimizer,</span> <span md-src-pos="22902..22908">dtype,</span> <span md-src-pos="22909..22923">learning rate,</span> <span md-src-pos="22924..22965">total batch size and lr-schedule to train</span> <a target="_blank" href="https://huggingface.co/yhavinga/t5-small-24L-dutch-english" md-src-pos="22966..23054">t5-small-24L-dutch-english</a><span md-src-pos="23054..23055">.</span></p>
516
- <p md-src-pos="23057..23098"><img src="bfloat16_loss.png" alt="Bfloat16 vs Float32" __idea-generated="true" md-src-pos="23057..23098" data-original-src="file:/home/yeb/Developer/yhavinga/nedd_x/app/bfloat16_loss.png"></p>
517
- <p md-src-pos="23100..23378"><span md-src-pos="23100..23111">In the end,</span> <span md-src-pos="23112..23167">all models released on the hub are trained with Flax in</span> <code md-src-pos="23168..23177">float32</code><span md-src-pos="23177..23178">.</span> <span md-src-pos="23179..23193">For reference,</span> <span md-src-pos="23194..23195">I</span>'<span md-src-pos="23196..23214">ve ran Stas Bekman</span>'<span md-src-pos="23215..23227">s script for</span> <a target="_blank" href="https://github.com/stas00/ml-ways/blob/master/numbers/detect-model-pretrained-in-bf16-fp16-fp32.ipynb" md-src-pos="23228..23376">bf16, fp16 or fp32 model pretrain detection</a><span md-src-pos="23376..23377">.</span></p>
518
- <pre class="code-fence" md-src-pos="23380..24514"><code md-src-pos="23380..24514">
519
- <div class="code-fence-highlighter-copy-button" data-fence-content="ICAgICAgICAgICAgICAgICAgICAgICBuYW1lICAgICAgICAgICAgICAgICAgICAgICAgfCAgYWJzIG1pbiAgfCAgYWJzIG1heCAgCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLXwtLS0tLS0tLS0tLXwtLS0tLS0tLS0tLQp5aGF2aW5nYS90NS1iYXNlLWR1dGNoICAgICAgICAgICAgICAgICAgICAgICAgICAgICB8IDEuNzU3ZS0wOSB8IDYuNzkyZSswMQp5aGF2aW5nYS90NS12MS4xLWJhc2UtZHV0Y2gtdW5jYXNlZCAgICAgICAgICAgICAgICB8IDEuMjE4ZS0wOSB8IDYuNzA4ZSswMgp5aGF2aW5nYS90NS12MS4xLWJhc2UtZHV0Y2gtY2FzZWQgICAgICAgICAgICAgICAgICB8IDMuMDA5ZS0wOSB8IDguODIxZSswMgp5aGF2aW5nYS90NS12MS4xLWxhcmdlLWR1dGNoLWNhc2VkICAgICAgICAgICAgICAgICB8IDAuMDAwZSswMCB8IDUuMDUzZSswMwp5aGF2aW5nYS90NS12MV8xLWJhc2UtZHV0Y2gtZW5nbGlzaC1jYXNlZCAgICAgICAgICB8IDUuMTQwZS0wOSB8IDMuMTExZSswMwp5aGF2aW5nYS90NS12MV8xLWJhc2UtZHV0Y2gtZW5nbGlzaC1jYXNlZC0xMDI0ICAgICB8IDkuMzU5ZS0xMCB8IDEuMzA4ZSswMgp5aGF2aW5nYS90NS1zbWFsbC0yNEwtZHV0Y2gtZW5nbGlzaCAgICAgICAgICAgICAgICB8IDEuNTc3ZS0wOSB8IDEuMjc2ZSswMgp5aGF2aW5nYS90NS14bC00TC1kdXRjaC1lbmdsaXNoLWNhc2VkICAgICAgICAgICAgICB8IDMuMjM0ZS0xMSB8IDMuOTg2ZSswMQp5aGF2aW5nYS90NS1iYXNlLTM2TC1kdXRjaC1lbmdsaXNoLWNhc2VkICAgICAgICAgICB8IDIuNDA5ZS0xMCB8IDYuMTA0ZSswMQp5aGF2aW5nYS90NS1lZmYteGwtOGwtZHV0Y2gtZW5nbGlzaC1jYXNlZCAgICAgICAgICB8IDUuNTMwZS0xMCB8IDguOTEyZSswMgp5aGF2aW5nYS90NS1lZmYtbGFyZ2UtOGwtZHV0Y2gtZW5nbGlzaC1jYXNlZCAgICAgICB8IDEuMDg2ZS0xMCB8IDUuMTI4ZSswMgp5aGF2aW5nYS90NS1iYXNlLTM2TC1jY21hdHJpeC1tdWx0aSAgICAgICAgICAgICAgICB8IDEuNzE1ZS0xMSB8IDMuNzQ2ZSswMQp5aGF2aW5nYS90NS1zbWFsbC0yNEwtY2NtYXRyaXgtbXVsdGkgICAgICAgICAgICAgICB8IDcuMDg2ZS0xMCB8IDEuMDUzZSswMgo=">
520
-
521
- <img class="code-fence-highlighter-copy-button-icon">
522
 
523
- </div><span md-src-pos="23380..23384"></span><span md-src-pos="23384..23459"> name | abs min | abs max </span>
524
- <span md-src-pos="23460..23535">---------------------------------------------------|-----------|-----------</span>
525
- <span md-src-pos="23536..23610">yhavinga/t5-base-dutch | 1.757e-09 | 6.792e+01</span>
526
- <span md-src-pos="23611..23685">yhavinga/t5-v1.1-base-dutch-uncased | 1.218e-09 | 6.708e+02</span>
527
- <span md-src-pos="23686..23760">yhavinga/t5-v1.1-base-dutch-cased | 3.009e-09 | 8.821e+02</span>
528
- <span md-src-pos="23761..23835">yhavinga/t5-v1.1-large-dutch-cased | 0.000e+00 | 5.053e+03</span>
529
- <span md-src-pos="23836..23910">yhavinga/t5-v1_1-base-dutch-english-cased | 5.140e-09 | 3.111e+03</span>
530
- <span md-src-pos="23911..23985">yhavinga/t5-v1_1-base-dutch-english-cased-1024 | 9.359e-10 | 1.308e+02</span>
531
- <span md-src-pos="23986..24060">yhavinga/t5-small-24L-dutch-english | 1.577e-09 | 1.276e+02</span>
532
- <span md-src-pos="24061..24135">yhavinga/t5-xl-4L-dutch-english-cased | 3.234e-11 | 3.986e+01</span>
533
- <span md-src-pos="24136..24210">yhavinga/t5-base-36L-dutch-english-cased | 2.409e-10 | 6.104e+01</span>
534
- <span md-src-pos="24211..24285">yhavinga/t5-eff-xl-8l-dutch-english-cased | 5.530e-10 | 8.912e+02</span>
535
- <span md-src-pos="24286..24360">yhavinga/t5-eff-large-8l-dutch-english-cased | 1.086e-10 | 5.128e+02</span>
536
- <span md-src-pos="24361..24435">yhavinga/t5-base-36L-ccmatrix-multi | 1.715e-11 | 3.746e+01</span>
537
- <span md-src-pos="24436..24510">yhavinga/t5-small-24L-ccmatrix-multi | 7.086e-10 | 1.053e+02</span>
538
- <span md-src-pos="24511..24511"></span><span md-src-pos="24511..24514"></span></code></pre>
539
- <h3 md-src-pos="24516..24554">Training t5-base-36L-dutch-english</h3>
540
- <p md-src-pos="24556..24958"><span md-src-pos="24556..24668">The following image shows the loss curves of the sessions in which I was trying to find the right combination of</span> <span md-src-pos="24669..24685">total batch size</span> (<span md-src-pos="24687..24721">by adjusting gradient accumulation</span>)<span md-src-pos="24722..24723">,</span> <span md-src-pos="24724..24751">learning rate and datatype.</span> <span md-src-pos="24752..24766">Unfortunately,</span> <span md-src-pos="24767..24784">again I could not</span> <span md-src-pos="24785..24818">find a good setting for bfloat16.</span> <span md-src-pos="24819..24867">The three green runs are the ones that end up in</span> <code md-src-pos="24868..24895">t5-base-36L-dutch-english</code><span md-src-pos="24895..24896">.</span> <span md-src-pos="24897..24930">Numbers shown are learning reate,</span> <span md-src-pos="24931..24958">dtype and total batch size.</span></p>
541
- <p md-src-pos="24960..25020"><img src="training_base_36l_losses.png" alt="t5 base 36L training losses" __idea-generated="true" md-src-pos="24960..25020" data-original-src="file:/home/yeb/Developer/yhavinga/nedd_x/app/training_base_36l_losses.png"></p>
542
- <h2 md-src-pos="25022..25035">Evaluation</h2>
543
- <h3 md-src-pos="25037..25086">Optimizer and learning rate for summarization</h3>
544
- <p md-src-pos="25088..25365"><span md-src-pos="25088..25195">Finetuning summarization requires more memory than translation due to the longer sequence lengths involved.</span> <span md-src-pos="25196..25255">I wondered if I could use Adafactor instead of Adam and ran</span> <span md-src-pos="25256..25277">a sweep to test this.</span> <span md-src-pos="25278..25318">The sweep was configured with Hyperband,</span> <span md-src-pos="25319..25365">so not all training runs completed to the end.</span></p>
545
- <p md-src-pos="25367..25439"><img src="optim_lr_summarization.png" alt="Optimizer Learning rate for summarization" __idea-generated="true" md-src-pos="25367..25439" data-original-src="file:/home/yeb/Developer/yhavinga/nedd_x/app/optim_lr_summarization.png"></p>
546
- <p md-src-pos="25441..25479"><span md-src-pos="25441..25478">The training losses are graphed below</span>:</p>
547
- <p md-src-pos="25481..25564"><img src="training_losses_summarization_sweep.png" alt="Training losses for summarization sweep" __idea-generated="true" md-src-pos="25481..25564" data-original-src="file:/home/yeb/Developer/yhavinga/nedd_x/app/training_losses_summarization_sweep.png"></p>
548
- <p md-src-pos="25566..25921"><span md-src-pos="25566..25642">While the Adafactor run with learning rate 7e-4 came close to the Adam runs,</span> <span md-src-pos="25643..25689">the consistent stability of training with Adam</span> <span md-src-pos="25690..25769">made me stick with Adam as optimizer for evaluation runs on the several models.</span> <span md-src-pos="25770..25811">For translation the results were similar,</span> <span md-src-pos="25812..25881">though in the end I needed to configure a lower learning rate for all</span> <span md-src-pos="25882..25920">models to converge during fine-tuning.</span></p>
549
- <h3 md-src-pos="25923..25950">Running evaluation runs</h3>
550
- <p md-src-pos="25952..26355"><span md-src-pos="25952..26033">The original T5 paper ran all evaluations with a constant learning rate of 0.001.</span> <span md-src-pos="26034..26068">According to the sweep 0.001 would</span> <span md-src-pos="26069..26123">work nicely with the Adam optimizer for summarization.</span> <span md-src-pos="26124..26185">A single model evaluation consisted of fine-tuning the model,</span> <span md-src-pos="26186..26260">followed by running predictions and metrics calculation on the test split.</span> <span md-src-pos="26261..26301">Fine-tuning for evaluation was done on a</span> <span md-src-pos="26302..26355">limited set of example from the fine-tuning datasets.</span></p>
551
- <table md-src-pos="26357..26869">
552
- <thead>
553
- <tr md-src-pos="26357..26413">
554
- <th align="right" md-src-pos="26358..26373"></th>
555
- <th md-src-pos="26374..26392">Summarization</th>
556
- <th md-src-pos="26393..26412">Translation</th>
557
- </tr>
558
- </thead>
559
- <tbody>
560
- <tr md-src-pos="26471..26527">
561
- <td align="right" md-src-pos="26472..26487">Dataset</td>
562
- <td md-src-pos="26488..26506">CNN Dailymail NL</td>
563
- <td md-src-pos="26507..26526">CCMatrix en -&gt; nl</td>
564
- </tr>
565
- <tr class="intellij-row-even" md-src-pos="26528..26584">
566
- <td align="right" md-src-pos="26529..26544">#Samples</td>
567
- <td md-src-pos="26545..26563">50K</td>
568
- <td md-src-pos="26564..26583">50K</td>
569
- </tr>
570
- <tr md-src-pos="26585..26641">
571
- <td align="right" md-src-pos="26586..26601">Optimizer</td>
572
- <td md-src-pos="26602..26620">Adam</td>
573
- <td md-src-pos="26621..26640">Adam</td>
574
- </tr>
575
- <tr class="intellij-row-even" md-src-pos="26642..26698">
576
- <td align="right" md-src-pos="26643..26658">learning rate</td>
577
- <td md-src-pos="26659..26677">0.001</td>
578
- <td md-src-pos="26678..26697">0.0005</td>
579
- </tr>
580
- <tr md-src-pos="26699..26755">
581
- <td align="right" md-src-pos="26700..26715">source length</td>
582
- <td md-src-pos="26716..26734">1024</td>
583
- <td md-src-pos="26735..26754">128</td>
584
- </tr>
585
- <tr class="intellij-row-even" md-src-pos="26756..26812">
586
- <td align="right" md-src-pos="26757..26772">target length</td>
587
- <td md-src-pos="26773..26791">142</td>
588
- <td md-src-pos="26792..26811">128</td>
589
- </tr>
590
- <tr md-src-pos="26813..26869">
591
- <td align="right" md-src-pos="26814..26829">#eval samples</td>
592
- <td md-src-pos="26830..26848">1000</td>
593
- <td md-src-pos="26849..26868">1000</td>
594
- </tr>
595
- </tbody>
596
- </table>
597
- <p md-src-pos="26872..26943"><span md-src-pos="26872..26942">The graph below shows the train loss curves for the summarization runs</span>:</p>
598
- <p md-src-pos="26945..27021"><img src="train_loss_eval_summarization.png" alt="Train loss evaluation T5 summarization" __idea-generated="true" md-src-pos="26945..27021" data-original-src="file:/home/yeb/Developer/yhavinga/nedd_x/app/train_loss_eval_summarization.png"></p>
599
- <p md-src-pos="27023..27092"><span md-src-pos="27023..27091">The graph below shows the train loss curves for the translation runs</span>:</p>
600
- <p md-src-pos="27094..27169"><img src="train_loss_eval_t5_translation.png" alt="Train loss evaluation T5 translation" __idea-generated="true" md-src-pos="27094..27169" data-original-src="file:/home/yeb/Developer/yhavinga/nedd_x/app/train_loss_eval_t5_translation.png"></p>
601
- <p md-src-pos="27171..27494"><span md-src-pos="27171..27216">The figure below shows the evaluation scores,</span> <span md-src-pos="27217..27266">where the x-axis shows the translation Bleu score</span> (<span md-src-pos="27268..27284">higher is better</span>) <span md-src-pos="27286..27339">and y-axis the summarization Rouge1 translation score</span> (<span md-src-pos="27341..27357">higher is better</span>)<span md-src-pos="27358..27359">.</span> <span md-src-pos="27360..27405">Point size is proportional to the model size.</span> <span md-src-pos="27406..27451">Models with faster inference speed are green,</span> <span md-src-pos="27452..27477">slower inference speed is</span> <span md-src-pos="27478..27494">plotted as blue.</span></p>
602
- <p md-src-pos="27496..27559"><img src="evaluation_t5_dutch_english.png" alt="Evaluation T5 Dutch English" __idea-generated="true" md-src-pos="27496..27559" data-original-src="file:/home/yeb/Developer/yhavinga/nedd_x/app/evaluation_t5_dutch_english.png"></p>
603
- <p md-src-pos="27561..28104"><span md-src-pos="27561..27593">While it is clear that the model</span> <code md-src-pos="27594..27627">t5-base-36L-dutch-english-cased</code> (<span md-src-pos="27629..27649">with 729M parameters</span>) <span md-src-pos="27651..27671">has the best scores,</span> <span md-src-pos="27672..27679">it also</span> <span md-src-pos="27680..27705">among the slowest models.</span> <span md-src-pos="27706..27715">The model</span> <code md-src-pos="27716..27753">t5-eff-large-8l-dutch-english-cased</code> (<span md-src-pos="27755..27775">with 335M parameters</span>) <span md-src-pos="27777..27796">has the second best</span> <span md-src-pos="27797..27841">training loss after 390 steps in both tasks,</span> <span md-src-pos="27842..27878">but with a 4 times faster inference.</span> <span md-src-pos="27879..27915">Surprizing is the difference between</span> <code md-src-pos="27916..27950">t5-v1_1-base-dutch-english-cased</code> <span md-src-pos="27951..27954">and</span> <code md-src-pos="27955..27994">t5-v1_1-base-dutch-english-cased-1024</code><span md-src-pos="27994..27995">,</span> <span md-src-pos="27996..28035">most notable on the summarization task.</span> <span md-src-pos="28036..28103">This might be due to the difference in pre-training sequence length</span>:</p>
604
- <h3 md-src-pos="28106..28137">Sequence length 512 or 1024</h3>
605
- <p md-src-pos="28139..28593"><span md-src-pos="28139..28149">The models</span> <code md-src-pos="28150..28184">t5-v1_1-base-dutch-english-cased</code> <span md-src-pos="28185..28188">and</span> <code md-src-pos="28189..28228">t5-v1_1-base-dutch-english-cased-1024</code> <span md-src-pos="28229..28260">have the same model dimensions,</span> <span md-src-pos="28261..28311">but are pre-trained on different sequence lenghts,</span> <span md-src-pos="28312..28338">512 and 1024 respectively.</span> <span md-src-pos="28339..28412">The evaluation loss and accuracy of the models do not look too different.</span> <span md-src-pos="28413..28465">Since training of the 1024 sequence length model was</span> <span md-src-pos="28466..28484">very slow and didn</span>'<span md-src-pos="28485..28516">t converge a was was very slow,</span> <span md-src-pos="28517..28536">I stopped it early.</span> <span md-src-pos="28537..28574">The figure below shows the evaluation</span> <span md-src-pos="28575..28593">loss and accuracy.</span></p>
606
- <p md-src-pos="28595..28685"><img src="t5v1_1eval_loss_and_accuracy.png" alt="T5 v11 base dutch english eval loss and accuracypng" __idea-generated="true" md-src-pos="28595..28685" data-original-src="file:/home/yeb/Developer/yhavinga/nedd_x/app/t5v1_1eval_loss_and_accuracy.png"></p>
607
- <p md-src-pos="28687..29049"><span md-src-pos="28687..28749">The 512 sequence length model was trained for 10 epochs of the</span> <code md-src-pos="28750..28757">small</code> <span md-src-pos="28758..28770">nl+en config</span> (<span md-src-pos="28772..28789">186B tokens total</span>) <span md-src-pos="28791..28803">and the 1024</span> <span md-src-pos="28804..28847">sequence length model about 2 epochs of the</span> <code md-src-pos="28848..28855">large</code> <span md-src-pos="28856..28868">nl+en config</span> (<span md-src-pos="28870..28887">100B tokens total</span>)<span md-src-pos="28888..28889">.</span> <span md-src-pos="28890..28921">While I expected both models to</span> <span md-src-pos="28922..28960">perform similarly on downstream tasks,</span> <span md-src-pos="28961..29018">the 1024 sequence length model has better scores for both</span> <span md-src-pos="29019..29049">summarization and translation.</span></p>
608
- <p md-src-pos="2755..2810"><span md-src-pos="2755..2798">Some final
609
- notes:</p>
610
- <ul md-src-pos="2812..4929">
611
- <li md-src-pos="2812..2869">Note: The <code md-src-pos="2824..2834">t5-small</code> model with 24 layers is not small.</li>
612
- <li md-src-pos="2870..3120">Training with more layers is much slower than you'd expect from the increased model size. It is also more difficult to get batch size and learning rate right.
613
- See e.g. the section about finding the right hyperparameters for the base-36L training.</li>
614
- <li md-src-pos="3121..3339">The 'larger' models are not only harder to pre-train, but also harder to fine-tune. The optimizer eats up a lot of space, and the amount of memory required also depends on the length of source and target sequences.</li>
615
- <li md-src-pos="3340..3446">When iterating over models and running evaluation, a sqlite database can be used to scribble results on.</li>
616
- <li md-src-pos="3447..3602">PyCharm. Remote debugging from your workstation to either a TPU VM or your deep-learning workstation gives very good insight into the data structures.</li>
617
- <li md-src-pos="3603..3731">When increasing the batch size, increase the learning rate. bs * 2 -&gt; lr * sqrt(2) is a good heuristic but mileage may vary.</li>
618
- <li md-src-pos="3732..3934">Dropout or not. It is a regularization technique, but also takes up memory. First try without dropout. If that doesn't work, try it with dropout. The smaller models can probably be trained without.</li>
619
- <li md-src-pos="3935..4040">Training in <code md-src-pos="3949..3959">bfloat16</code> is hard to get right. If suspicious of a result, switch back to <code md-src-pos="4024..4033">float32</code> first.</li>
620
- <li md-src-pos="4041..4218">Translation evaluation: the low score of the 128 seq len models on opus books may be because of the brevity penaly... that books may have sentences longer than 128 tokens.</li>
621
- <li md-src-pos="4219..4354"><code md-src-pos="4221..4258">t5-eff-large-8l-dutch-english-cased</code> has good aptitude for the translation task and is fast - good candidate for serious fine-tuning</li>
622
- <li md-src-pos="4355..4442"><code md-src-pos="4357..4387">t5-xl-4l-dutch-english-cased</code> is both slow and exhibits bad fine-tuning performance.</li>
623
- <li md-src-pos="4443..4502">I need gradient accumulation in the flax s2s pmap script.</li>
624
- <li md-src-pos="4503..4778">The dataset directly results output, for pre-training, fine-tuning and also evaluation. Next efforts should favor spending time on dataset cleaning. (The perplexity measure that the Bertin project uses might be useful to filter the dataset on, to reduce training time.)</li>
625
- <li md-src-pos="4779..4929">Good Bleu score does not necessarily mean fluent text. Evaluation loss on a large translation dataset might be better suited for model comparison.</li>
626
- </ul>
627
- <h2 md-src-pos="29051..29070">Acknowledgements</h2>
628
- <p md-src-pos="29072..29450"><span md-src-pos="29072..29171">This project would not have been possible without compute generously provided by Google through the</span> <a target="_blank" href="https://sites.research.google/trc/" md-src-pos="29172..29228">TPU Research Cloud</a><span md-src-pos="29228..29229">.</span> <span md-src-pos="29230..29245">The HuggingFace</span> <span
629
- md-src-pos="29246..29248">&#x1F917</span> <span md-src-pos="29249..29288">ecosystem was instrumental in all parts</span> <span md-src-pos="29289..29305">of the training.</span> <span md-src-pos="29306..29313">Weights</span> <span md-src-pos="29314..29315">&amp;</span> <span md-src-pos="29316..29379">Biases made it possible to keep track of many training sessions</span> <span md-src-pos="29380..29450">and orchestrate hyper-parameter sweeps with insightful visualizations.</span></p>
630
- <p md-src-pos="29452..29527"><span md-src-pos="29452..29462">Created by</span> <a target="_blank" href="https://www.linkedin.com/in/yeb-havinga-86530825/" md-src-pos="29463..29527">Yeb Havinga</a></p>
631
  </div>
632
  </body>
633
  </html>
 
13
  <body>
14
  <div md-src-pos="0..29528">
15
  <h1 md-src-pos="0..26">Pre-training Dutch <!-- doesnt work on HF spaces?? span class="emoji">πŸ‡³πŸ‡± πŸ‡§πŸ‡ͺ</span--> T5 models </h1>
16
+ <p>TL;DR, Look below for <a href="#model-list">the list of pre-trained Dutch and Dutch+English models</a>.</p>
17
+
18
+ <p md-src-pos="28..495"><span md-src-pos="28..64">A few months ago, I was given access to Google's TPU Research Cloud (TRC). My goal was to train several Dutch and Dutch+English T5 models, limited to model sizes that can run on a single GPU.
19
+ The T5 model architecture is a text Seq2Seq encoder/decoder model architecture. Since it encodes all inputs and outputs as text, it can be fine-tuned on a wide range of tasks.</span></p>
20
  <ul md-src-pos="497..2062">
21
  <li md-src-pos="497..751"><strong md-src-pos="499..624"><a target="_blank" href="https://arxiv.org/abs/1910.10683.pdf" md-src-pos="501..622">Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer</a></strong> by <em md-src-pos="628..750">Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu</em>.</li>
22
  <li md-src-pos="752..1482"><strong md-src-pos="754..859"><a target="_blank" href="https://arxiv.org/abs/2110.08207" md-src-pos="756..857">Multitask Prompted Training Enables Zero-Shot Task Generalization</a></strong> by <em md-src-pos="863..1481">Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf, Alexander M. Rush</em>.</li>
 
29
  <li md-src-pos="2203..2305"><a target="_blank" href="https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104" md-src-pos="2205..2305">https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104</a></li>
30
  <li md-src-pos="2306..2407"><a target="_blank" href="https://github.com/huggingface/transformers/tree/main/examples/research_projects/jax-projects#talks" md-src-pos="2308..2407">https://github.com/huggingface/transformers/tree/main/examples/research_projects/jax-projects#talks</a></li>
31
  </ul>
32
+
33
+ <h2 md-src-pos="18893..18908">Pre-training</h2>
34
+ <h3 md-src-pos="18910..18925">mC4 dataset</h3>
35
+ <p>
36
+ A few weeks before the <a target="_blank" href="https://github.com/huggingface/transformers/tree/main/examples/research_projects/jax-projects#talks">Flax/JAX Community Week</a> started, the multilingual C4 (mC4) TensorFlow dataset was prepared and <a target="_blank" href="https://huggingface.co/datasets/allenai/c4">released</a> by AllenNLP. This dataset was created by the original T5 authors and is composed of text files in many languages. We cleaned Dutch mC4 with <a target="_blank" href="https://gitlab.com/yhavinga/c4nlpreproc">code adapted</a> from the C4 TensorFlow dataset, and used the resulting text files in the pre-training scripts. We also verified that Dutch C4 was deduplicated.</p>
37
+ <p>
38
+ To be able to easily reuse this dataset for more pre-training sessions with Huggingfaces scripts, a Huggingface dataset was created: <a target="_blank" href="https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned" md-src-pos="19449..19522">mc4_nl_cleaned</a>. For Dutch and English training, a couple of additional configs were added to the generation script. These configs produce interleaved Dutch and English texts with a 1:1 ratio. For instance, the <a target="_blank" href="https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned/viewer/micro_en_nl/train" md-src-pos="19690..19792">micro_en_nl config</a> config mixes Dutch with English samples.
39
+ The cleaned English C4 dataset is about 5 times larger (in compressed bytes) than the Dutch part. 1:1 interleaving with Dutch discards about 80% of English C4.
40
+ The full cleaned Dutch mC4 dataset is 151GB, and still is (June '22) the largest Dutch cleaned corpus currently available on the HF Hub.
41
+ </p>
42
+
43
+ <h3 md-src-pos="20163..20243">Unsupervised Training Objective</h3>
44
+ <p md-src-pos="2409..2753"><span md-src-pos="2409..2463">The Dutch and Dutch+English T5 models are pre-trained using the masked language modeling (MLM) objective.
45
+ During pre-training, 15% of the tokens are masked and each span of masked tokens is replaced by a sentinel token.</span>
46
+ </p>
47
+ <h3 md-src-pos="20163..20243">Why are some models trained for multiple epochs on a smaller config?</h3>
48
+ <p>When I was using an old version of the Flax mlm pretraining script, I noticed that the per-batch training speed seemed slower at the beginning of epochs when a larger dataset config was used. Also, on large configs, batch shuffling would fail with a TPU out-of-memory error. For these reasons, I started experimenting with training for more epochs on smaller configs.
49
+ </p>
50
+ <p><span md-src-pos="20616..20634">This should be ok.</span> <span md-src-pos="20635..20717">In the original T5 paper downstream performance was compared between training on 2</span><sup><span md-src-pos="20722..20724">35</span></sup> <span md-src-pos="20731..20749">tokens vs training</span> <span md-src-pos="20750..20784">multiple epochs on a smaller part.</span> <span md-src-pos="20785..20800">64 repeats of 2</span><sup><span md-src-pos="20805..20807">29</span></sup> <span md-src-pos="20814..20871">tokens did not result in degraded downstream performance.</span> <span md-src-pos="20872..20881">The model</span> <code md-src-pos="20882..20925">yhavinga/t5-v1_1-base-dutch-english-cased</code> <span md-src-pos="20926..20943">is trained on the</span> <code md-src-pos="20944..20951">small</code> <span md-src-pos="20952..20973">config for 10 epochs.</span> </p>
51
+ <p><span>
52
+ In the end,</span> <span md-src-pos="20986..21001">a change to the</span> <a target="_blank" href="https://github.com/huggingface/transformers/blame/main/examples/flax/language-modeling/run_t5_mlm_flax.py" md-src-pos="21002..21130">pre-training script</a> <span md-src-pos="21131..21157">to perform batch shuffling</span> (<span md-src-pos="21159..21177">permuting an array</span>) <span md-src-pos="21179..21189">on the CPU</span> <span md-src-pos="21190..21250">instead of the accelerator device solved all related issues,</span> <span md-src-pos="21251..21303">and larger configs could be used without any issues.</span></p>
53
+ <h3 md-src-pos="21305..21338">Which optimizer and lr to use</h3>
54
+ <p md-src-pos="21340..22064"><span md-src-pos="21340..21346">During the </span> <span md-src-pos="21348..21428">Flax/Jax Community week we quickly decided on using Adafactor with learning rate 5e-3.</span> <span md-src-pos="21429..21460">I was sure that with more time,</span> <span md-src-pos="21461..21493">a better setting could be found.</span> <span md-src-pos="21494..21535">After performing seven sweeps with Adafactor,</span> <span md-src-pos="21536..21565">AdamW and Distributed Shampoo</span> (<span md-src-pos="21567..21579">experimental</span> <span md-src-pos="21580..21609">PJIT version from Dall-E mini</span>)<span md-src-pos="21610..21611">,</span> <span md-src-pos="21612..21646">I gave up to find better settings.</span> <span md-src-pos="21647..21705">The graph below shows the runs from all 7 sweeps combined.</span> <span md-src-pos="21706..21731">Apologies for the legend,</span> <span md-src-pos="21732..21774">I cannot show the optimizer in the legend,</span> <span md-src-pos="21775..21843">because the initial version of the training script had the optimizer</span> <code md-src-pos="21844..21857">--adafactor</code> <span md-src-pos="21858..21860">as</span> <span md-src-pos="21861..21869">boolean,</span> <span md-src-pos="21870..21928">which I later changed to a string with the optimizer name.</span> <span md-src-pos="21929..21986">All runs in the graph below that get the loss below 4 use</span> <strong md-src-pos="21987..22000">Adafactor</strong><span md-src-pos="22000..22001">.</span> <span md-src-pos="22002..22054">Peach-sweep-6 is dashed orange and has learning rate</span> <strong md-src-pos="22055..22063">5e-3</strong><span md-src-pos="22063..22064">.</span></p>
55
+ <p md-src-pos="22066..22129"><img src="adafactor_vs_adam_pretrain.png" alt="Adafactor vs Adam vs Shampoo" __idea-generated="true" md-src-pos="22066..22129" data-original-src="file:/home/yeb/Developer/yhavinga/nedd_x/app/adafactor_vs_adam_pretrain.png"></p>
56
+ <p md-src-pos="22131..22458"><span md-src-pos="22131..22235">While there probably is a setting that will allow Adam and Shampoo to also converge fast below loss 4.0,</span> <span md-src-pos="22236..22248">I was unable</span> <span md-src-pos="22249..22260">to find it.</span> <span md-src-pos="22261..22322">In a recent tweet Lucas Nestler had more success with Shampoo</span> (<a target="_blank" href="https://twitter.com/_clashluke/status/1535994026876252160" md-src-pos="22324..22381">https://twitter.com/_clashluke/status/1535994026876252160</a>) <span md-src-pos="22383..22458">so maybe I need to revisit the attempt with the latest upstream code bases.</span></p>
57
+ <h3 md-src-pos="22460..22508">Bfloat16 datatype and learning rate schedule</h3>
58
+ <p md-src-pos="22510..23055"><span md-src-pos="22510..22588">I had some additional options in the pre-training script that I wanted to use.</span> <span md-src-pos="22589..22623">An exponential decay learning rate</span> <span md-src-pos="22624..22684">schedule would allow me to pre-train for as long as desired,</span> <span md-src-pos="22685..22720">instead of a fixed number of steps.</span> <span md-src-pos="22721..22764">I was also keen to pre-train with bfloat16,</span> <span md-src-pos="22765..22808">for the reduced memory footprint and speed.</span> <span md-src-pos="22809..22821">This failed.</span> <span md-src-pos="22822..22901">The graph below shows different attempts with the legend showing the optimizer,</span> <span md-src-pos="22902..22908">dtype,</span> <span md-src-pos="22909..22923">learning rate,</span> <span md-src-pos="22924..22965">total batch size and lr-schedule to train</span> <a target="_blank" href="https://huggingface.co/yhavinga/t5-small-24L-dutch-english" md-src-pos="22966..23054">t5-small-24L-dutch-english</a><span md-src-pos="23054..23055">.</span></p>
59
+ <p md-src-pos="23057..23098"><img src="bfloat16_loss.png" alt="Bfloat16 vs Float32" __idea-generated="true" md-src-pos="23057..23098" data-original-src="file:/home/yeb/Developer/yhavinga/nedd_x/app/bfloat16_loss.png"></p>
60
+ <p md-src-pos="23100..23378"><span md-src-pos="23100..23111">In the end,</span> <span md-src-pos="23112..23167">all models released on the hub are trained with Flax in</span> <code md-src-pos="23168..23177">float32</code><span md-src-pos="23177..23178">.</span> <span md-src-pos="23179..23193">For reference,</span> <span md-src-pos="23194..23195">I</span>'<span md-src-pos="23196..23214">ve ran Stas Bekman</span>'<span md-src-pos="23215..23227">s script for</span> <a target="_blank" href="https://github.com/stas00/ml-ways/blob/master/numbers/detect-model-pretrained-in-bf16-fp16-fp32.ipynb" md-src-pos="23228..23376">bf16, fp16 or fp32 model pretrain detection</a><span md-src-pos="23376..23377">.</span></p>
61
+ <pre class="code-fence" md-src-pos="23380..24514"><code md-src-pos="23380..24514">
62
+ <div class="code-fence-highlighter-copy-button" data-fence-content="ICAgICAgICAgICAgICAgICAgICAgICBuYW1lICAgICAgICAgICAgICAgICAgICAgICAgfCAgYWJzIG1pbiAgfCAgYWJzIG1heCAgCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLXwtLS0tLS0tLS0tLXwtLS0tLS0tLS0tLQp5aGF2aW5nYS90NS1iYXNlLWR1dGNoICAgICAgICAgICAgICAgICAgICAgICAgICAgICB8IDEuNzU3ZS0wOSB8IDYuNzkyZSswMQp5aGF2aW5nYS90NS12MS4xLWJhc2UtZHV0Y2gtdW5jYXNlZCAgICAgICAgICAgICAgICB8IDEuMjE4ZS0wOSB8IDYuNzA4ZSswMgp5aGF2aW5nYS90NS12MS4xLWJhc2UtZHV0Y2gtY2FzZWQgICAgICAgICAgICAgICAgICB8IDMuMDA5ZS0wOSB8IDguODIxZSswMgp5aGF2aW5nYS90NS12MS4xLWxhcmdlLWR1dGNoLWNhc2VkICAgICAgICAgICAgICAgICB8IDAuMDAwZSswMCB8IDUuMDUzZSswMwp5aGF2aW5nYS90NS12MV8xLWJhc2UtZHV0Y2gtZW5nbGlzaC1jYXNlZCAgICAgICAgICB8IDUuMTQwZS0wOSB8IDMuMTExZSswMwp5aGF2aW5nYS90NS12MV8xLWJhc2UtZHV0Y2gtZW5nbGlzaC1jYXNlZC0xMDI0ICAgICB8IDkuMzU5ZS0xMCB8IDEuMzA4ZSswMgp5aGF2aW5nYS90NS1zbWFsbC0yNEwtZHV0Y2gtZW5nbGlzaCAgICAgICAgICAgICAgICB8IDEuNTc3ZS0wOSB8IDEuMjc2ZSswMgp5aGF2aW5nYS90NS14bC00TC1kdXRjaC1lbmdsaXNoLWNhc2VkICAgICAgICAgICAgICB8IDMuMjM0ZS0xMSB8IDMuOTg2ZSswMQp5aGF2aW5nYS90NS1iYXNlLTM2TC1kdXRjaC1lbmdsaXNoLWNhc2VkICAgICAgICAgICB8IDIuNDA5ZS0xMCB8IDYuMTA0ZSswMQp5aGF2aW5nYS90NS1lZmYteGwtOGwtZHV0Y2gtZW5nbGlzaC1jYXNlZCAgICAgICAgICB8IDUuNTMwZS0xMCB8IDguOTEyZSswMgp5aGF2aW5nYS90NS1lZmYtbGFyZ2UtOGwtZHV0Y2gtZW5nbGlzaC1jYXNlZCAgICAgICB8IDEuMDg2ZS0xMCB8IDUuMTI4ZSswMgp5aGF2aW5nYS90NS1iYXNlLTM2TC1jY21hdHJpeC1tdWx0aSAgICAgICAgICAgICAgICB8IDEuNzE1ZS0xMSB8IDMuNzQ2ZSswMQp5aGF2aW5nYS90NS1zbWFsbC0yNEwtY2NtYXRyaXgtbXVsdGkgICAgICAgICAgICAgICB8IDcuMDg2ZS0xMCB8IDEuMDUzZSswMgo=">
63
+
64
+ <img class="code-fence-highlighter-copy-button-icon">
65
+
66
+ </div><span md-src-pos="23380..23384"></span><span md-src-pos="23384..23459"> name | abs min | abs max </span>
67
+ <span md-src-pos="23460..23535">---------------------------------------------------|-----------|-----------</span>
68
+ <span md-src-pos="23536..23610">yhavinga/t5-base-dutch | 1.757e-09 | 6.792e+01</span>
69
+ <span md-src-pos="23611..23685">yhavinga/t5-v1.1-base-dutch-uncased | 1.218e-09 | 6.708e+02</span>
70
+ <span md-src-pos="23686..23760">yhavinga/t5-v1.1-base-dutch-cased | 3.009e-09 | 8.821e+02</span>
71
+ <span md-src-pos="23761..23835">yhavinga/t5-v1.1-large-dutch-cased | 0.000e+00 | 5.053e+03</span>
72
+ <span md-src-pos="23836..23910">yhavinga/t5-v1_1-base-dutch-english-cased | 5.140e-09 | 3.111e+03</span>
73
+ <span md-src-pos="23911..23985">yhavinga/t5-v1_1-base-dutch-english-cased-1024 | 9.359e-10 | 1.308e+02</span>
74
+ <span md-src-pos="23986..24060">yhavinga/t5-small-24L-dutch-english | 1.577e-09 | 1.276e+02</span>
75
+ <span md-src-pos="24061..24135">yhavinga/t5-xl-4L-dutch-english-cased | 3.234e-11 | 3.986e+01</span>
76
+ <span md-src-pos="24136..24210">yhavinga/t5-base-36L-dutch-english-cased | 2.409e-10 | 6.104e+01</span>
77
+ <span md-src-pos="24211..24285">yhavinga/t5-eff-xl-8l-dutch-english-cased | 5.530e-10 | 8.912e+02</span>
78
+ <span md-src-pos="24286..24360">yhavinga/t5-eff-large-8l-dutch-english-cased | 1.086e-10 | 5.128e+02</span>
79
+ <span md-src-pos="24361..24435">yhavinga/t5-base-36L-ccmatrix-multi | 1.715e-11 | 3.746e+01</span>
80
+ <span md-src-pos="24436..24510">yhavinga/t5-small-24L-ccmatrix-multi | 7.086e-10 | 1.053e+02</span>
81
+ <span md-src-pos="24511..24511"></span><span md-src-pos="24511..24514"></span></code></pre>
82
+ <h2>Fine-tuning</h2>
83
+ <h3 md-src-pos="24516..24554">Training t5-base-36L-dutch-english</h3>
84
+ <p md-src-pos="24556..24958"><span md-src-pos="24556..24668">The following image shows the loss curves of the sessions in which I was trying to find the right combination of</span> <span md-src-pos="24669..24685">total batch size</span> (<span md-src-pos="24687..24721">by adjusting gradient accumulation</span>)<span md-src-pos="24722..24723">,</span> <span md-src-pos="24724..24751">learning rate and datatype.</span> <span md-src-pos="24752..24766">Unfortunately,</span> <span md-src-pos="24767..24784">again I could not</span> <span md-src-pos="24785..24818">find a good setting for bfloat16.</span> <span md-src-pos="24819..24867">The three green runs are the ones that end up in</span> <code md-src-pos="24868..24895">t5-base-36L-dutch-english</code><span md-src-pos="24895..24896">.</span> <span md-src-pos="24897..24930">Numbers shown are learning reate,</span> <span md-src-pos="24931..24958">dtype and total batch size.</span></p>
85
+ <p md-src-pos="24960..25020"><img src="training_base_36l_losses.png" alt="t5 base 36L training losses" __idea-generated="true" md-src-pos="24960..25020" data-original-src="file:/home/yeb/Developer/yhavinga/nedd_x/app/training_base_36l_losses.png"></p>
86
+ <h2 md-src-pos="25022..25035">Evaluation</h2>
87
+ <h3 md-src-pos="25037..25086">Optimizer and learning rate for summarization</h3>
88
+ <p md-src-pos="25088..25365"><span md-src-pos="25088..25195">Finetuning summarization requires more memory than translation due to the longer sequence lengths involved.</span> <span md-src-pos="25196..25255">I wondered if I could use Adafactor instead of Adam and ran</span> <span md-src-pos="25256..25277">a sweep to test this.</span> <span md-src-pos="25278..25318">The sweep was configured with Hyperband,</span> <span md-src-pos="25319..25365">so not all training runs completed to the end.</span></p>
89
+ <p md-src-pos="25367..25439"><img src="optim_lr_summarization.png" alt="Optimizer Learning rate for summarization" __idea-generated="true" md-src-pos="25367..25439" data-original-src="file:/home/yeb/Developer/yhavinga/nedd_x/app/optim_lr_summarization.png"></p>
90
+ <p md-src-pos="25441..25479"><span md-src-pos="25441..25478">The training losses are graphed below</span>:</p>
91
+ <p md-src-pos="25481..25564"><img src="training_losses_summarization_sweep.png" alt="Training losses for summarization sweep" __idea-generated="true" md-src-pos="25481..25564" data-original-src="file:/home/yeb/Developer/yhavinga/nedd_x/app/training_losses_summarization_sweep.png"></p>
92
+ <p md-src-pos="25566..25921"><span md-src-pos="25566..25642">While the Adafactor run with learning rate 7e-4 came close to the Adam runs,</span> <span md-src-pos="25643..25689">the consistent stability of training with Adam</span> <span md-src-pos="25690..25769">made me stick with Adam as optimizer for evaluation runs on the several models.</span> <span md-src-pos="25770..25811">For translation the results were similar,</span> <span md-src-pos="25812..25881">though in the end I needed to configure a lower learning rate for all</span> <span md-src-pos="25882..25920">models to converge during fine-tuning.</span></p>
93
+ <h3 md-src-pos="25923..25950">Running evaluation runs</h3>
94
+ <p md-src-pos="25952..26355"><span md-src-pos="25952..26033">The original T5 paper evaluated by fine-tuning on downstream tasks with a constant learning rate of 0.001.</span> <span md-src-pos="26034..26068">According to the sweep 0.001 would</span> <span md-src-pos="26069..26123">work nicely with the Adam optimizer for summarization.</span> <span md-src-pos="26124..26185">A single model evaluation consisted of fine-tuning the model,</span> <span md-src-pos="26186..26260">followed by running predictions and metrics calculation on the test split.</span> <span md-src-pos="26261..26301">Fine-tuning for evaluation was done on a</span> <span md-src-pos="26302..26355">limited set of example from the fine-tuning datasets.</span></p>
95
+ <table md-src-pos="26357..26869">
96
+ <thead>
97
+ <tr md-src-pos="26357..26413">
98
+ <th align="right" md-src-pos="26358..26373"></th>
99
+ <th md-src-pos="26374..26392">Summarization</th>
100
+ <th md-src-pos="26393..26412">Translation</th>
101
+ </tr>
102
+ </thead>
103
+ <tbody>
104
+ <tr md-src-pos="26471..26527">
105
+ <td align="right" md-src-pos="26472..26487">Dataset</td>
106
+ <td md-src-pos="26488..26506">CNN Dailymail NL</td>
107
+ <td md-src-pos="26507..26526">CCMatrix en -&gt; nl</td>
108
+ </tr>
109
+ <tr class="intellij-row-even" md-src-pos="26528..26584">
110
+ <td align="right" md-src-pos="26529..26544">#Samples</td>
111
+ <td md-src-pos="26545..26563">50K</td>
112
+ <td md-src-pos="26564..26583">50K</td>
113
+ </tr>
114
+ <tr md-src-pos="26585..26641">
115
+ <td align="right" md-src-pos="26586..26601">Optimizer</td>
116
+ <td md-src-pos="26602..26620">Adam</td>
117
+ <td md-src-pos="26621..26640">Adam</td>
118
+ </tr>
119
+ <tr class="intellij-row-even" md-src-pos="26642..26698">
120
+ <td align="right" md-src-pos="26643..26658">learning rate</td>
121
+ <td md-src-pos="26659..26677">0.001</td>
122
+ <td md-src-pos="26678..26697">0.0005</td>
123
+ </tr>
124
+ <tr md-src-pos="26699..26755">
125
+ <td align="right" md-src-pos="26700..26715">source length</td>
126
+ <td md-src-pos="26716..26734">1024</td>
127
+ <td md-src-pos="26735..26754">128</td>
128
+ </tr>
129
+ <tr class="intellij-row-even" md-src-pos="26756..26812">
130
+ <td align="right" md-src-pos="26757..26772">target length</td>
131
+ <td md-src-pos="26773..26791">142</td>
132
+ <td md-src-pos="26792..26811">128</td>
133
+ </tr>
134
+ <tr md-src-pos="26813..26869">
135
+ <td align="right" md-src-pos="26814..26829">#eval samples</td>
136
+ <td md-src-pos="26830..26848">1000</td>
137
+ <td md-src-pos="26849..26868">1000</td>
138
+ </tr>
139
+ </tbody>
140
+ </table>
141
+ <p md-src-pos="26872..26943"><span md-src-pos="26872..26942">The graph below shows the train loss curves for the summarization runs</span>:</p>
142
+ <p md-src-pos="26945..27021"><img src="train_loss_eval_summarization.png" alt="Train loss evaluation T5 summarization" __idea-generated="true" md-src-pos="26945..27021" data-original-src="file:/home/yeb/Developer/yhavinga/nedd_x/app/train_loss_eval_summarization.png"></p>
143
+ <p md-src-pos="27023..27092"><span md-src-pos="27023..27091">The graph below shows the train loss curves for the translation runs</span>:</p>
144
+ <p md-src-pos="27094..27169"><img src="train_loss_eval_t5_translation.png" alt="Train loss evaluation T5 translation" __idea-generated="true" md-src-pos="27094..27169" data-original-src="file:/home/yeb/Developer/yhavinga/nedd_x/app/train_loss_eval_t5_translation.png"></p>
145
+ <p md-src-pos="27171..27494"><span md-src-pos="27171..27216">The figure below shows the evaluation scores,</span> <span md-src-pos="27217..27266">where the x-axis shows the translation Bleu score</span> (<span md-src-pos="27268..27284">higher is better</span>) <span md-src-pos="27286..27339">and y-axis the summarization Rouge1 translation score</span> (<span md-src-pos="27341..27357">higher is better</span>)<span md-src-pos="27358..27359">.</span> <span md-src-pos="27360..27405">Point size is proportional to the model size.</span> <span md-src-pos="27406..27451">Models with faster inference speed are green,</span> <span md-src-pos="27452..27477">slower inference speed is</span> <span md-src-pos="27478..27494">plotted as blue.</span></p>
146
+ <p md-src-pos="27496..27559"><img src="evaluation_t5_dutch_english.png" alt="Evaluation T5 Dutch English" __idea-generated="true" md-src-pos="27496..27559" data-original-src="file:/home/yeb/Developer/yhavinga/nedd_x/app/evaluation_t5_dutch_english.png"></p>
147
+ <p md-src-pos="27561..28104"><span md-src-pos="27561..27593">While it is clear that the model</span> <code md-src-pos="27594..27627">t5-base-36L-dutch-english-cased</code> (<span md-src-pos="27629..27649">with 729M parameters</span>) <span md-src-pos="27651..27671">has the best scores,</span> <span md-src-pos="27672..27679">it also</span> <span md-src-pos="27680..27705">among the slowest models.</span> <span md-src-pos="27706..27715">The model</span> <code md-src-pos="27716..27753">t5-eff-large-8l-dutch-english-cased</code> (<span md-src-pos="27755..27775">with 335M parameters</span>) <span md-src-pos="27777..27796">has the second best</span> <span md-src-pos="27797..27841">training loss after 390 steps in both tasks,</span> <span md-src-pos="27842..27878">but with a 4 times faster inference.</span> <span md-src-pos="27879..27915">Surprizing is the difference between</span> <code md-src-pos="27916..27950">t5-v1_1-base-dutch-english-cased</code> <span md-src-pos="27951..27954">and</span> <code md-src-pos="27955..27994">t5-v1_1-base-dutch-english-cased-1024</code><span md-src-pos="27994..27995">,</span> <span md-src-pos="27996..28035">most notable on the summarization task.</span> <span md-src-pos="28036..28103">This might be due to the difference in pre-training sequence length</span>:</p>
148
+ <h3 md-src-pos="28106..28137">Sequence length 512 or 1024</h3>
149
+ <p md-src-pos="28139..28593"><span md-src-pos="28139..28149">The models</span> <code md-src-pos="28150..28184">t5-v1_1-base-dutch-english-cased</code> <span md-src-pos="28185..28188">and</span> <code md-src-pos="28189..28228">t5-v1_1-base-dutch-english-cased-1024</code> <span md-src-pos="28229..28260">have the same model dimensions,</span> <span md-src-pos="28261..28311">but are pre-trained on different sequence lenghts,</span> <span md-src-pos="28312..28338">512 and 1024 respectively.</span> <span md-src-pos="28339..28412">The evaluation loss and accuracy of the models do not look too different.</span> <span md-src-pos="28413..28465">Since training of the 1024 sequence length model was</span> <span md-src-pos="28466..28484">very slow and didn</span>'<span md-src-pos="28485..28516">t converge a was was very slow,</span> <span md-src-pos="28517..28536">I stopped it early.</span> <span md-src-pos="28537..28574">The figure below shows the evaluation</span> <span md-src-pos="28575..28593">loss and accuracy.</span></p>
150
+ <p md-src-pos="28595..28685"><img src="t5v1_1eval_loss_and_accuracy.png" alt="T5 v11 base dutch english eval loss and accuracypng" __idea-generated="true" md-src-pos="28595..28685" data-original-src="file:/home/yeb/Developer/yhavinga/nedd_x/app/t5v1_1eval_loss_and_accuracy.png"></p>
151
+ <p md-src-pos="28687..29049"><span md-src-pos="28687..28749">The 512 sequence length model was trained for 10 epochs of the</span> <code md-src-pos="28750..28757">small</code> <span md-src-pos="28758..28770">nl+en config</span> (<span md-src-pos="28772..28789">186B tokens total</span>) <span md-src-pos="28791..28803">and the 1024</span> <span md-src-pos="28804..28847">sequence length model about 2 epochs of the</span> <code md-src-pos="28848..28855">large</code> <span md-src-pos="28856..28868">nl+en config</span> (<span md-src-pos="28870..28887">100B tokens total</span>)<span md-src-pos="28888..28889">.</span> <span md-src-pos="28890..28921">While I expected both models to</span> <span md-src-pos="28922..28960">perform similarly on downstream tasks,</span> <span md-src-pos="28961..29018">the 1024 sequence length model has better scores for both</span> <span md-src-pos="29019..29049">summarization and translation.</span></p>
152
+ <p md-src-pos="2755..2810"><span md-src-pos="2755..2798">Some final
153
+ notes:</p>
154
+ <ul md-src-pos="2812..4929">
155
+ <li md-src-pos="2812..2869">Note: The <code md-src-pos="2824..2834">t5-small</code> model with 24 layers is not small.</li>
156
+ <li md-src-pos="2870..3120">Training with more layers is much slower than you'd expect from the increased model size. It is also more difficult to get batch size and learning rate right.
157
+ See e.g. the section about finding the right hyperparameters for the base-36L training.</li>
158
+ <li md-src-pos="3121..3339">The 'larger' models are not only harder to pre-train, but also harder to fine-tune. The optimizer eats up a lot of space, and the amount of memory required also depends on the length of source and target sequences.</li>
159
+ <li md-src-pos="3340..3446">When iterating over models and running evaluation, a sqlite database can be used to scribble results on.</li>
160
+ <li md-src-pos="3447..3602">PyCharm. Remote debugging from your workstation to either a TPU VM or your deep-learning workstation gives very good insight into the data structures.</li>
161
+ <li md-src-pos="3603..3731">When increasing the batch size, increase the learning rate. bs * 2 -&gt; lr * sqrt(2) is a good heuristic but mileage may vary.</li>
162
+ <li md-src-pos="3732..3934">Dropout or not. It is a regularization technique, but also takes up memory. First try without dropout. If that doesn't work, try it with dropout. The smaller models can probably be trained without.</li>
163
+ <li md-src-pos="3935..4040">Training in <code md-src-pos="3949..3959">bfloat16</code> is hard to get right. If suspicious of a result, switch back to <code md-src-pos="4024..4033">float32</code> first.</li>
164
+ <li md-src-pos="4041..4218">Translation evaluation: the low score of the 128 seq len models on opus books may be because of the brevity penaly... that books may have sentences longer than 128 tokens.</li>
165
+ <li md-src-pos="4219..4354"><code md-src-pos="4221..4258">t5-eff-large-8l-dutch-english-cased</code> has good aptitude for the translation task and is fast - good candidate for serious fine-tuning</li>
166
+ <li md-src-pos="4355..4442"><code md-src-pos="4357..4387">t5-xl-4l-dutch-english-cased</code> is both slow and exhibits bad fine-tuning performance.</li>
167
+ <li md-src-pos="4443..4502">Gradient accumulation in the flax s2s pmap script would be nice.</li>
168
+ <li md-src-pos="4503..4778">The dataset directly results output, for pre-training, fine-tuning and also evaluation. Next efforts should favor spending time on dataset cleaning. (The perplexity measure that the Bertin project uses might be useful to filter the dataset on, to reduce training time.)</li>
169
+ <li md-src-pos="4779..4929">Good Bleu score does not necessarily mean fluent text. Evaluation loss on a large translation dataset might be better suited for model comparison.</li>
170
+ </ul>
171
+ <h2 md-src-pos="29051..29070">Acknowledgements</h2>
172
+ <p md-src-pos="29072..29450"><span md-src-pos="29072..29171">This project would not have been possible without compute generously provided by Google through the</span> <a target="_blank" href="https://sites.research.google/trc/" md-src-pos="29172..29228">TPU Research Cloud</a><span md-src-pos="29228..29229">.</span> <span md-src-pos="29230..29245">The HuggingFace</span> <span
173
+ md-src-pos="29246..29248">&#x1F917</span> <span md-src-pos="29249..29288">ecosystem was instrumental in all parts</span> <span md-src-pos="29289..29305">of the training.</span> <span md-src-pos="29306..29313">Weights</span> <span md-src-pos="29314..29315">&amp;</span> <span md-src-pos="29316..29379">Biases made it possible to keep track of many training sessions</span> <span md-src-pos="29380..29450">and orchestrate hyper-parameter sweeps with insightful visualizations.</span></p>
174
+ <p md-src-pos="29452..29527"><span md-src-pos="29452..29462">Created by</span> <a target="_blank" href="https://www.linkedin.com/in/yeb-havinga-86530825/" md-src-pos="29463..29527">Yeb Havinga</a></p>
175
+
176
+
177
+ <a id="model-list"><h2 md-src-pos="4931..4979">Pre-trained Dutch and Dutch+English T5 models</h2></a>
178
  <p md-src-pos="4981..5522"><span md-src-pos="4981..5024">Three types of T5 models have been trained.</span> <code md-src-pos="5025..5040">t5-base-dutch</code> <span md-src-pos="5041..5086">is the only model with an original T5 config.</span> <span md-src-pos="5087..5132">The other model types t5-v1.1 and t5-eff have</span> <code md-src-pos="5133..5145">gated-relu</code> <span md-src-pos="5146..5156">instead of</span> <code md-src-pos="5157..5163">relu</code> <span md-src-pos="5164..5187">as activation function,</span> <span md-src-pos="5188..5218">and trained with a drop-out of</span> <code md-src-pos="5219..5224">0.0</code> <span md-src-pos="5225..5254">unless training would diverge</span> (<code md-src-pos="5256..5283">t5-v1.1-large-dutch-cased</code>)<span md-src-pos="5284..5285">.</span> <span md-src-pos="5286..5353">The T5-eff models are models that differ in their number of layers.</span> <span md-src-pos="5354..5373">The table will list</span> <span md-src-pos="5374..5413">the several dimensions of these models.</span> <span md-src-pos="5414..5450">Not all t5-eff models are efficient,</span> <span md-src-pos="5451..5489">the best example being the inefficient</span> <code md-src-pos="5490..5520">t5-xl-4L-dutch-english-cased</code><span md-src-pos="5520..5521">.</span></p>
179
  <table md-src-pos="5524..14583">
180
  <thead>
181
+ <tr md-src-pos="5524..6614">
182
  <tr md-src-pos="5524..6614">
183
  <th md-src-pos="5525..5544"></th>
184
  <th md-src-pos="5545..5611"><a target="_blank" href="https://huggingface.co/yhavinga/t5-base-dutch" md-src-pos="5546..5608">t5-base-dutch</a></th>
 
646
  </tr>
647
  </tbody>
648
  </table>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
649
 
650
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
651
  </div>
652
  </body>
653
  </html>