File size: 7,235 Bytes
5672777
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
# Layers

Layers are the fundamental building blocks for NLP models. They can be used to
assemble new `tf.keras` layers or models.

*   [MultiHeadAttention](attention.py) implements an optionally masked attention
    between query, key, value tensors as described in
    ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762). If
    `from_tensor` and `to_tensor` are the same, then this is self-attention.

*   [BigBirdAttention](bigbird_attention.py) implements a sparse attention
    mechanism that reduces this quadratic dependency to linear described in
    ["Big Bird: Transformers for Longer Sequences"](https://arxiv.org/abs/2007.14062).

*   [CachedAttention](attention.py) implements an attention layer with cache
    used for auto-aggressive decoding.

*   [KernelAttention](kernel_attention.py) implements a group of attention
    mechansim that express the self-attention as a linear dot-product of
    kernel feature maps and make use of the associativity property of
    matrix products to reduce the complexity from quadratic to linear. The
    implementation includes methods described in ["Transformers are RNNs:
    Fast Autoregressive Transformers with Linear Attention"](https://arxiv.org/abs/2006.16236),
    ["Rethinking Attention with Performers"](https://arxiv.org/abs/2009.14794),
    ["Random Feature Attention"](https://openreview.net/pdf?id=QtTKTdVrFBB).

*   [MatMulWithMargin](mat_mul_with_margin.py) implements a matrix
    multiplication with margin layer used for training retrieval / ranking
    tasks, as described in ["Improving Multilingual Sentence Embedding using
    Bi-directional Dual Encoder with Additive Margin
    Softmax"](https://www.ijcai.org/Proceedings/2019/0746.pdf).

*   [MultiChannelAttention](multi_channel_attention.py) implements an variant of
    multi-head attention which can be used to merge multiple streams for
    cross-attentions.

*   [TalkingHeadsAttention](talking_heads_attention.py) implements the talking
    heads attention, as decribed in
    ["Talking-Heads Attention"](https://arxiv.org/abs/2003.02436).

*   [Transformer](transformer.py) implements an optionally masked transformer as
    described in
    ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762).

*   [TransformerDecoderBlock](transformer.py) TransformerDecoderBlock is made up
    of self multi-head attention, cross multi-head attention and feedforward
    network.

*   [RandomFeatureGaussianProcess](gaussian_process.py) implements random
    feature-based Gaussian process described in ["Random Features for
     Large-Scale Kernel Machines"](https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf).

*   [ReuseMultiHeadAttention](reuse_attention.py) supports passing
    attention scores to be reused and avoid recomputation described in
    ["Leveraging redundancy in attention with Reuse Transformers"](https://arxiv.org/abs/2110.06821).

*   [ReuseTransformer](reuse_transformer.py) supports reusing attention scores
    from lower layers in higher layers to avoid recomputing attention scores
    described in ["Leveraging redundancy in attention with Reuse Transformers"](https://arxiv.org/abs/2110.06821).

*   [ReZeroTransformer](rezero_transformer.py) implements Transformer with
    ReZero described in
    ["ReZero is All You Need: Fast Convergence at Large Depth"](https://arxiv.org/abs/2003.04887).

*   [OnDeviceEmbedding](on_device_embedding.py) implements efficient embedding
    lookups designed for TPU-based models.

*   [PositionalEmbedding](position_embedding.py) creates a positional embedding
    as described in ["BERT: Pre-training of Deep Bidirectional Transformers for
    Language Understanding"](https://arxiv.org/abs/1810.04805).

*   [SelfAttentionMask](self_attention_mask.py) creates a 3D attention mask from
    a 2D tensor mask.

*   [SpectralNormalization](spectral_normalization.py) implements a tf.Wrapper
    that applies spectral normalization regularization to a given layer. See
    [Spectral Norm Regularization for Improving the Generalizability of
     Deep Learning](https://arxiv.org/abs/1705.10941)

*   [MaskedSoftmax](masked_softmax.py) implements a softmax with an optional
    masking input. If no mask is provided to this layer, it performs a standard
    softmax; however, if a mask tensor is applied (which should be 1 in
    positions where the data should be allowed through, and 0 where the data
    should be masked), the output will have masked positions set to
    approximately zero.

*   [`MaskedLM`](masked_lm.py) implements a masked language model. It assumes
    the embedding table variable is passed to it.

*   [ClassificationHead](cls_head.py) A pooling head over a sequence of
    embeddings, commonly used by classification tasks.

*   [GaussianProcessClassificationHead](cls_head.py) A spectral-normalized
    neural Gaussian process (SNGP)-based classification head as described in
    ["Simple and Principled Uncertainty Estimation with Deterministic Deep
     Learning via Distance Awareness"](https://arxiv.org/abs/2006.10108).

*   [GatedFeedforward](gated_feedforward.py) implements the gated linear layer
    feedforward as described in
    ["GLU Variants Improve Transformer"](https://arxiv.org/abs/2002.05202).

*   [MultiHeadRelativeAttention](relative_attention.py) implements a variant
    of multi-head attention with support for relative position encodings as
    described in ["Transformer-XL: Attentive Language Models Beyond a
    Fixed-Length Context"](https://arxiv.org/abs/1901.02860). This also has
    extended support for segment-based attention, a re-parameterization
    introduced in ["XLNet: Generalized Autoregressive Pretraining for Language
    Understanding"](https://arxiv.org/abs/1906.08237).

*   [TwoStreamRelativeAttention](relative_attention.py) implements a variant
    of multi-head relative attention as described in ["XLNet: Generalized
    Autoregressive Pretraining for Language Understanding"]
    (https://arxiv.org/abs/1906.08237). This takes in a query and content
    stream and applies self attention.

*   [TransformerXL](transformer_xl.py) implements Transformer XL introduced in
    ["Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context"]
    (https://arxiv.org/abs/1901.02860). This contains `TransformerXLBlock`, a
    block containing either one or two stream relative self-attention as well as
    subsequent feedforward networks. It also contains `TransformerXL`, which
    contains attention biases as well as multiple `TransformerXLBlocks`.

*   [MobileBertEmbedding](mobile_bert_layers.py) and
    [MobileBertTransformer](mobile_bert_layers.py) implement the embedding layer
    and also transformer layer proposed in the
    [MobileBERT paper](https://arxiv.org/pdf/2004.02984.pdf).

*   [BertPackInputs](text_layers.py) and
    [BertTokenizer](text_layers.py) and [SentencepieceTokenizer](text_layers.py)
    implements the layer to tokenize raw text and pack them into the inputs for
    BERT models.
    
*   [TransformerEncoderBlock](transformer_encoder_block.py) implements
    an optionally masked transformer as described in
    ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762).