Spaces:
Sleeping
Sleeping
# Layers | |
Layers are the fundamental building blocks for NLP models. They can be used to | |
assemble new `tf.keras` layers or models. | |
* [MultiHeadAttention](attention.py) implements an optionally masked attention | |
between query, key, value tensors as described in | |
["Attention Is All You Need"](https://arxiv.org/abs/1706.03762). If | |
`from_tensor` and `to_tensor` are the same, then this is self-attention. | |
* [BigBirdAttention](bigbird_attention.py) implements a sparse attention | |
mechanism that reduces this quadratic dependency to linear described in | |
["Big Bird: Transformers for Longer Sequences"](https://arxiv.org/abs/2007.14062). | |
* [CachedAttention](attention.py) implements an attention layer with cache | |
used for auto-aggressive decoding. | |
* [KernelAttention](kernel_attention.py) implements a group of attention | |
mechansim that express the self-attention as a linear dot-product of | |
kernel feature maps and make use of the associativity property of | |
matrix products to reduce the complexity from quadratic to linear. The | |
implementation includes methods described in ["Transformers are RNNs: | |
Fast Autoregressive Transformers with Linear Attention"](https://arxiv.org/abs/2006.16236), | |
["Rethinking Attention with Performers"](https://arxiv.org/abs/2009.14794), | |
["Random Feature Attention"](https://openreview.net/pdf?id=QtTKTdVrFBB). | |
* [MatMulWithMargin](mat_mul_with_margin.py) implements a matrix | |
multiplication with margin layer used for training retrieval / ranking | |
tasks, as described in ["Improving Multilingual Sentence Embedding using | |
Bi-directional Dual Encoder with Additive Margin | |
Softmax"](https://www.ijcai.org/Proceedings/2019/0746.pdf). | |
* [MultiChannelAttention](multi_channel_attention.py) implements an variant of | |
multi-head attention which can be used to merge multiple streams for | |
cross-attentions. | |
* [TalkingHeadsAttention](talking_heads_attention.py) implements the talking | |
heads attention, as decribed in | |
["Talking-Heads Attention"](https://arxiv.org/abs/2003.02436). | |
* [Transformer](transformer.py) implements an optionally masked transformer as | |
described in | |
["Attention Is All You Need"](https://arxiv.org/abs/1706.03762). | |
* [TransformerDecoderBlock](transformer.py) TransformerDecoderBlock is made up | |
of self multi-head attention, cross multi-head attention and feedforward | |
network. | |
* [RandomFeatureGaussianProcess](gaussian_process.py) implements random | |
feature-based Gaussian process described in ["Random Features for | |
Large-Scale Kernel Machines"](https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf). | |
* [ReuseMultiHeadAttention](reuse_attention.py) supports passing | |
attention scores to be reused and avoid recomputation described in | |
["Leveraging redundancy in attention with Reuse Transformers"](https://arxiv.org/abs/2110.06821). | |
* [ReuseTransformer](reuse_transformer.py) supports reusing attention scores | |
from lower layers in higher layers to avoid recomputing attention scores | |
described in ["Leveraging redundancy in attention with Reuse Transformers"](https://arxiv.org/abs/2110.06821). | |
* [ReZeroTransformer](rezero_transformer.py) implements Transformer with | |
ReZero described in | |
["ReZero is All You Need: Fast Convergence at Large Depth"](https://arxiv.org/abs/2003.04887). | |
* [OnDeviceEmbedding](on_device_embedding.py) implements efficient embedding | |
lookups designed for TPU-based models. | |
* [PositionalEmbedding](position_embedding.py) creates a positional embedding | |
as described in ["BERT: Pre-training of Deep Bidirectional Transformers for | |
Language Understanding"](https://arxiv.org/abs/1810.04805). | |
* [SelfAttentionMask](self_attention_mask.py) creates a 3D attention mask from | |
a 2D tensor mask. | |
* [SpectralNormalization](spectral_normalization.py) implements a tf.Wrapper | |
that applies spectral normalization regularization to a given layer. See | |
[Spectral Norm Regularization for Improving the Generalizability of | |
Deep Learning](https://arxiv.org/abs/1705.10941) | |
* [MaskedSoftmax](masked_softmax.py) implements a softmax with an optional | |
masking input. If no mask is provided to this layer, it performs a standard | |
softmax; however, if a mask tensor is applied (which should be 1 in | |
positions where the data should be allowed through, and 0 where the data | |
should be masked), the output will have masked positions set to | |
approximately zero. | |
* [`MaskedLM`](masked_lm.py) implements a masked language model. It assumes | |
the embedding table variable is passed to it. | |
* [ClassificationHead](cls_head.py) A pooling head over a sequence of | |
embeddings, commonly used by classification tasks. | |
* [GaussianProcessClassificationHead](cls_head.py) A spectral-normalized | |
neural Gaussian process (SNGP)-based classification head as described in | |
["Simple and Principled Uncertainty Estimation with Deterministic Deep | |
Learning via Distance Awareness"](https://arxiv.org/abs/2006.10108). | |
* [GatedFeedforward](gated_feedforward.py) implements the gated linear layer | |
feedforward as described in | |
["GLU Variants Improve Transformer"](https://arxiv.org/abs/2002.05202). | |
* [MultiHeadRelativeAttention](relative_attention.py) implements a variant | |
of multi-head attention with support for relative position encodings as | |
described in ["Transformer-XL: Attentive Language Models Beyond a | |
Fixed-Length Context"](https://arxiv.org/abs/1901.02860). This also has | |
extended support for segment-based attention, a re-parameterization | |
introduced in ["XLNet: Generalized Autoregressive Pretraining for Language | |
Understanding"](https://arxiv.org/abs/1906.08237). | |
* [TwoStreamRelativeAttention](relative_attention.py) implements a variant | |
of multi-head relative attention as described in ["XLNet: Generalized | |
Autoregressive Pretraining for Language Understanding"] | |
(https://arxiv.org/abs/1906.08237). This takes in a query and content | |
stream and applies self attention. | |
* [TransformerXL](transformer_xl.py) implements Transformer XL introduced in | |
["Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context"] | |
(https://arxiv.org/abs/1901.02860). This contains `TransformerXLBlock`, a | |
block containing either one or two stream relative self-attention as well as | |
subsequent feedforward networks. It also contains `TransformerXL`, which | |
contains attention biases as well as multiple `TransformerXLBlocks`. | |
* [MobileBertEmbedding](mobile_bert_layers.py) and | |
[MobileBertTransformer](mobile_bert_layers.py) implement the embedding layer | |
and also transformer layer proposed in the | |
[MobileBERT paper](https://arxiv.org/pdf/2004.02984.pdf). | |
* [BertPackInputs](text_layers.py) and | |
[BertTokenizer](text_layers.py) and [SentencepieceTokenizer](text_layers.py) | |
implements the layer to tokenize raw text and pack them into the inputs for | |
BERT models. | |
* [TransformerEncoderBlock](transformer_encoder_block.py) implements | |
an optionally masked transformer as described in | |
["Attention Is All You Need"](https://arxiv.org/abs/1706.03762). | |