metadata

license: mit

Moving Average Gated Attention (Mega): Pretrained LM

This repo contains pretrained weights for a language model with the Mega architecture (see paper). I used the Mega source code (namely the MegaEncoderLayer class) and created wrappers for token embeddings and MLM prediction. This model was pretrained for 5 epochs (11.3k gradient steps) on wikitext-103, which took roughly 5 hours on a single T4 (in Colab's free tier).

See the Colab notebook for further training details. In order to load the pretrained weights for this model, you'll need to use the Mega repo along with the example code at the end of the Colab notebook.