Networks

Networks are combinations of tf.keras layers (and possibly other networks). They are tf.keras models that would not be trained alone. It encapsulates common network structures like a transformer encoder into an easily handled object with a standardized configuration.

BertEncoder implements a bi-directional Transformer-based encoder as described in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". It includes the embedding lookups, transformer layers and pooling layer.
AlbertEncoder implements a Transformer-encoder described in the paper "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations". Compared with BERT, ALBERT refactorizes embedding parameters into two smaller matrices and shares parameters across layers.
MobileBERTEncoder implements the MobileBERT network described in the paper "MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices".
Classification contains a single hidden layer, and is intended for use as a classification or regression (if number of classes is set to 1) head.
PackedSequenceEmbedding implements an embedding network that supports packed sequences and position ids.
SpanLabeling implements a single-span labeler (that is, a prediction head that can predict one start and end index per batch item) based on a single dense hidden layer. It can be used in the SQuAD task.
XLNetBase implements the base network used in "XLNet: Generalized Autoregressive Pretraining for Language Understanding" (https://arxiv.org/abs/1906.08237). It includes embedding lookups, relative position encodings, mask computations, segment matrix computations and Transformer XL layers using one or two stream relative self-attention.
FNet implements the encoder model from "FNet: Mixing Tokens with Fourier Transforms". FNet has the same structure as a Transformer encoder, except that all or most of the self-attention sublayers are replaced with Fourier sublayers.
Sparse Mixer implements the encoder model from "Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT ". Sparse Mixer consists of layers of heterogeneous encoder blocks. Each encoder block contains a linear mixing or an attention sublayer together with a (dense) MLP or sparsely activated Mixture-of-Experts sublayer.