Spaces:
Sleeping
Sleeping
File size: 7,235 Bytes
5672777 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
# Layers
Layers are the fundamental building blocks for NLP models. They can be used to
assemble new `tf.keras` layers or models.
* [MultiHeadAttention](attention.py) implements an optionally masked attention
between query, key, value tensors as described in
["Attention Is All You Need"](https://arxiv.org/abs/1706.03762). If
`from_tensor` and `to_tensor` are the same, then this is self-attention.
* [BigBirdAttention](bigbird_attention.py) implements a sparse attention
mechanism that reduces this quadratic dependency to linear described in
["Big Bird: Transformers for Longer Sequences"](https://arxiv.org/abs/2007.14062).
* [CachedAttention](attention.py) implements an attention layer with cache
used for auto-aggressive decoding.
* [KernelAttention](kernel_attention.py) implements a group of attention
mechansim that express the self-attention as a linear dot-product of
kernel feature maps and make use of the associativity property of
matrix products to reduce the complexity from quadratic to linear. The
implementation includes methods described in ["Transformers are RNNs:
Fast Autoregressive Transformers with Linear Attention"](https://arxiv.org/abs/2006.16236),
["Rethinking Attention with Performers"](https://arxiv.org/abs/2009.14794),
["Random Feature Attention"](https://openreview.net/pdf?id=QtTKTdVrFBB).
* [MatMulWithMargin](mat_mul_with_margin.py) implements a matrix
multiplication with margin layer used for training retrieval / ranking
tasks, as described in ["Improving Multilingual Sentence Embedding using
Bi-directional Dual Encoder with Additive Margin
Softmax"](https://www.ijcai.org/Proceedings/2019/0746.pdf).
* [MultiChannelAttention](multi_channel_attention.py) implements an variant of
multi-head attention which can be used to merge multiple streams for
cross-attentions.
* [TalkingHeadsAttention](talking_heads_attention.py) implements the talking
heads attention, as decribed in
["Talking-Heads Attention"](https://arxiv.org/abs/2003.02436).
* [Transformer](transformer.py) implements an optionally masked transformer as
described in
["Attention Is All You Need"](https://arxiv.org/abs/1706.03762).
* [TransformerDecoderBlock](transformer.py) TransformerDecoderBlock is made up
of self multi-head attention, cross multi-head attention and feedforward
network.
* [RandomFeatureGaussianProcess](gaussian_process.py) implements random
feature-based Gaussian process described in ["Random Features for
Large-Scale Kernel Machines"](https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf).
* [ReuseMultiHeadAttention](reuse_attention.py) supports passing
attention scores to be reused and avoid recomputation described in
["Leveraging redundancy in attention with Reuse Transformers"](https://arxiv.org/abs/2110.06821).
* [ReuseTransformer](reuse_transformer.py) supports reusing attention scores
from lower layers in higher layers to avoid recomputing attention scores
described in ["Leveraging redundancy in attention with Reuse Transformers"](https://arxiv.org/abs/2110.06821).
* [ReZeroTransformer](rezero_transformer.py) implements Transformer with
ReZero described in
["ReZero is All You Need: Fast Convergence at Large Depth"](https://arxiv.org/abs/2003.04887).
* [OnDeviceEmbedding](on_device_embedding.py) implements efficient embedding
lookups designed for TPU-based models.
* [PositionalEmbedding](position_embedding.py) creates a positional embedding
as described in ["BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding"](https://arxiv.org/abs/1810.04805).
* [SelfAttentionMask](self_attention_mask.py) creates a 3D attention mask from
a 2D tensor mask.
* [SpectralNormalization](spectral_normalization.py) implements a tf.Wrapper
that applies spectral normalization regularization to a given layer. See
[Spectral Norm Regularization for Improving the Generalizability of
Deep Learning](https://arxiv.org/abs/1705.10941)
* [MaskedSoftmax](masked_softmax.py) implements a softmax with an optional
masking input. If no mask is provided to this layer, it performs a standard
softmax; however, if a mask tensor is applied (which should be 1 in
positions where the data should be allowed through, and 0 where the data
should be masked), the output will have masked positions set to
approximately zero.
* [`MaskedLM`](masked_lm.py) implements a masked language model. It assumes
the embedding table variable is passed to it.
* [ClassificationHead](cls_head.py) A pooling head over a sequence of
embeddings, commonly used by classification tasks.
* [GaussianProcessClassificationHead](cls_head.py) A spectral-normalized
neural Gaussian process (SNGP)-based classification head as described in
["Simple and Principled Uncertainty Estimation with Deterministic Deep
Learning via Distance Awareness"](https://arxiv.org/abs/2006.10108).
* [GatedFeedforward](gated_feedforward.py) implements the gated linear layer
feedforward as described in
["GLU Variants Improve Transformer"](https://arxiv.org/abs/2002.05202).
* [MultiHeadRelativeAttention](relative_attention.py) implements a variant
of multi-head attention with support for relative position encodings as
described in ["Transformer-XL: Attentive Language Models Beyond a
Fixed-Length Context"](https://arxiv.org/abs/1901.02860). This also has
extended support for segment-based attention, a re-parameterization
introduced in ["XLNet: Generalized Autoregressive Pretraining for Language
Understanding"](https://arxiv.org/abs/1906.08237).
* [TwoStreamRelativeAttention](relative_attention.py) implements a variant
of multi-head relative attention as described in ["XLNet: Generalized
Autoregressive Pretraining for Language Understanding"]
(https://arxiv.org/abs/1906.08237). This takes in a query and content
stream and applies self attention.
* [TransformerXL](transformer_xl.py) implements Transformer XL introduced in
["Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context"]
(https://arxiv.org/abs/1901.02860). This contains `TransformerXLBlock`, a
block containing either one or two stream relative self-attention as well as
subsequent feedforward networks. It also contains `TransformerXL`, which
contains attention biases as well as multiple `TransformerXLBlocks`.
* [MobileBertEmbedding](mobile_bert_layers.py) and
[MobileBertTransformer](mobile_bert_layers.py) implement the embedding layer
and also transformer layer proposed in the
[MobileBERT paper](https://arxiv.org/pdf/2004.02984.pdf).
* [BertPackInputs](text_layers.py) and
[BertTokenizer](text_layers.py) and [SentencepieceTokenizer](text_layers.py)
implements the layer to tokenize raw text and pack them into the inputs for
BERT models.
* [TransformerEncoderBlock](transformer_encoder_block.py) implements
an optionally masked transformer as described in
["Attention Is All You Need"](https://arxiv.org/abs/1706.03762).
|