metadata
license: mit
Moving Average Gated Attention (Mega): Pretrained LM
This repo contains pretrained weights for a language model with the Mega architecture (see paper).
I used the Mega source code (namely the MegaEncoderLayer
class) and created wrappers for token embeddings and MLM prediction. This model
was pretrained for 5 epochs (11.3k gradient steps) on wikitext-103, which took roughly 5 hours on a single T4 (in Colab's free tier).
See the Colab notebook for further training details. In order to load the pretrained weights for this model, you'll need to use the Mega repo along with the example code at the end of the Colab notebook.