ASL-MoViNet-T5-translator

Sleeping

App Files Files Community

ASL-MoViNet-T5-translator / official /nlp /modeling /layers /README.md

deanna-emery

updates

93528c6 over 1 year ago

preview code

raw

history blame contribute delete

7.24 kB

	# Layers

	Layers are the fundamental building blocks for NLP models. They can be used to
	assemble new `tf.keras` layers or models.

	* [MultiHeadAttention](attention.py) implements an optionally masked attention
	between query, key, value tensors as described in
	["Attention Is All You Need"](https://arxiv.org/abs/1706.03762). If
	`from_tensor` and `to_tensor` are the same, then this is self-attention.

	* [BigBirdAttention](bigbird_attention.py) implements a sparse attention
	mechanism that reduces this quadratic dependency to linear described in
	["Big Bird: Transformers for Longer Sequences"](https://arxiv.org/abs/2007.14062).

	* [CachedAttention](attention.py) implements an attention layer with cache
	used for auto-aggressive decoding.

	* [KernelAttention](kernel_attention.py) implements a group of attention
	mechansim that express the self-attention as a linear dot-product of
	kernel feature maps and make use of the associativity property of
	matrix products to reduce the complexity from quadratic to linear. The
	implementation includes methods described in ["Transformers are RNNs:
	Fast Autoregressive Transformers with Linear Attention"](https://arxiv.org/abs/2006.16236),
	["Rethinking Attention with Performers"](https://arxiv.org/abs/2009.14794),
	["Random Feature Attention"](https://openreview.net/pdf?id=QtTKTdVrFBB).

	* [MatMulWithMargin](mat_mul_with_margin.py) implements a matrix
	multiplication with margin layer used for training retrieval / ranking
	tasks, as described in ["Improving Multilingual Sentence Embedding using
	Bi-directional Dual Encoder with Additive Margin
	Softmax"](https://www.ijcai.org/Proceedings/2019/0746.pdf).

	* [MultiChannelAttention](multi_channel_attention.py) implements an variant of
	multi-head attention which can be used to merge multiple streams for
	cross-attentions.

	* [TalkingHeadsAttention](talking_heads_attention.py) implements the talking
	heads attention, as decribed in
	["Talking-Heads Attention"](https://arxiv.org/abs/2003.02436).

	* [Transformer](transformer.py) implements an optionally masked transformer as
	described in
	["Attention Is All You Need"](https://arxiv.org/abs/1706.03762).

	* [TransformerDecoderBlock](transformer.py) TransformerDecoderBlock is made up
	of self multi-head attention, cross multi-head attention and feedforward
	network.

	* [RandomFeatureGaussianProcess](gaussian_process.py) implements random
	feature-based Gaussian process described in ["Random Features for
	Large-Scale Kernel Machines"](https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf).

	* [ReuseMultiHeadAttention](reuse_attention.py) supports passing
	attention scores to be reused and avoid recomputation described in
	["Leveraging redundancy in attention with Reuse Transformers"](https://arxiv.org/abs/2110.06821).

	* [ReuseTransformer](reuse_transformer.py) supports reusing attention scores
	from lower layers in higher layers to avoid recomputing attention scores
	described in ["Leveraging redundancy in attention with Reuse Transformers"](https://arxiv.org/abs/2110.06821).

	* [ReZeroTransformer](rezero_transformer.py) implements Transformer with
	ReZero described in
	["ReZero is All You Need: Fast Convergence at Large Depth"](https://arxiv.org/abs/2003.04887).

	* [OnDeviceEmbedding](on_device_embedding.py) implements efficient embedding
	lookups designed for TPU-based models.

	* [PositionalEmbedding](position_embedding.py) creates a positional embedding
	as described in ["BERT: Pre-training of Deep Bidirectional Transformers for
	Language Understanding"](https://arxiv.org/abs/1810.04805).

	* [SelfAttentionMask](self_attention_mask.py) creates a 3D attention mask from
	a 2D tensor mask.

	* [SpectralNormalization](spectral_normalization.py) implements a tf.Wrapper
	that applies spectral normalization regularization to a given layer. See
	[Spectral Norm Regularization for Improving the Generalizability of
	Deep Learning](https://arxiv.org/abs/1705.10941)

	* [MaskedSoftmax](masked_softmax.py) implements a softmax with an optional
	masking input. If no mask is provided to this layer, it performs a standard
	softmax; however, if a mask tensor is applied (which should be 1 in
	positions where the data should be allowed through, and 0 where the data
	should be masked), the output will have masked positions set to
	approximately zero.

	* [`MaskedLM`](masked_lm.py) implements a masked language model. It assumes
	the embedding table variable is passed to it.

	* [ClassificationHead](cls_head.py) A pooling head over a sequence of
	embeddings, commonly used by classification tasks.

	* [GaussianProcessClassificationHead](cls_head.py) A spectral-normalized
	neural Gaussian process (SNGP)-based classification head as described in
	["Simple and Principled Uncertainty Estimation with Deterministic Deep
	Learning via Distance Awareness"](https://arxiv.org/abs/2006.10108).

	* [GatedFeedforward](gated_feedforward.py) implements the gated linear layer
	feedforward as described in
	["GLU Variants Improve Transformer"](https://arxiv.org/abs/2002.05202).

	* [MultiHeadRelativeAttention](relative_attention.py) implements a variant
	of multi-head attention with support for relative position encodings as
	described in ["Transformer-XL: Attentive Language Models Beyond a
	Fixed-Length Context"](https://arxiv.org/abs/1901.02860). This also has
	extended support for segment-based attention, a re-parameterization
	introduced in ["XLNet: Generalized Autoregressive Pretraining for Language
	Understanding"](https://arxiv.org/abs/1906.08237).

	* [TwoStreamRelativeAttention](relative_attention.py) implements a variant
	of multi-head relative attention as described in ["XLNet: Generalized
	Autoregressive Pretraining for Language Understanding"]
	(https://arxiv.org/abs/1906.08237). This takes in a query and content
	stream and applies self attention.

	* [TransformerXL](transformer_xl.py) implements Transformer XL introduced in
	["Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context"]
	(https://arxiv.org/abs/1901.02860). This contains `TransformerXLBlock`, a
	block containing either one or two stream relative self-attention as well as
	subsequent feedforward networks. It also contains `TransformerXL`, which
	contains attention biases as well as multiple `TransformerXLBlocks`.

	* [MobileBertEmbedding](mobile_bert_layers.py) and
	[MobileBertTransformer](mobile_bert_layers.py) implement the embedding layer
	and also transformer layer proposed in the
	[MobileBERT paper](https://arxiv.org/pdf/2004.02984.pdf).

	* [BertPackInputs](text_layers.py) and
	[BertTokenizer](text_layers.py) and [SentencepieceTokenizer](text_layers.py)
	implements the layer to tokenize raw text and pack them into the inputs for
	BERT models.

	* [TransformerEncoderBlock](transformer_encoder_block.py) implements
	an optionally masked transformer as described in
	["Attention Is All You Need"](https://arxiv.org/abs/1706.03762).