metadata
license: mit
Moving Average Gated Attention (Mega): Pretrained LM
This repo contains pretrained weights for a language model with the Mega architecture (see paper).
I used the Mega source code (namely the MegaEncoderLayer
class) and created wrappers for token embeddings and MLM prediction. This model
was pretrained for 5 epochs (11.3k gradient steps) on wikitext-103, which took roughly 5 hours on a single T4 (in Colab's free tier).
See the Colab notebook for further training details and example code for reuse.