Spaces:
Runtime error
Runtime error
<!--Copyright 2020 The HuggingFace Team. All rights reserved. | |
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations under the License. | |
--> | |
# ALBERT | |
<div class="flex flex-wrap space-x-1"> | |
<a href="https://huggingface.co/models?filter=albert"> | |
<img alt="Models" src="https://img.shields.io/badge/All_model_pages-albert-blueviolet"> | |
</a> | |
<a href="https://huggingface.co/spaces/docs-demos/albert-base-v2"> | |
<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"> | |
</a> | |
</div> | |
## Overview | |
The ALBERT model was proposed in [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942) by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, | |
Radu Soricut. It presents two parameter-reduction techniques to lower memory consumption and increase the training | |
speed of BERT: | |
- Splitting the embedding matrix into two smaller matrices. | |
- Using repeating layers split among groups. | |
The abstract from the paper is the following: | |
*Increasing model size when pretraining natural language representations often results in improved performance on | |
downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations, | |
longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction | |
techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows | |
that our proposed methods lead to models that scale much better compared to the original BERT. We also use a | |
self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks | |
with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and | |
SQuAD benchmarks while having fewer parameters compared to BERT-large.* | |
Tips: | |
- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather | |
than the left. | |
- ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains | |
similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same | |
number of (repeating) layers. | |
- Embedding size E is different from hidden size H justified because the embeddings are context independent (one embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V being the vocab size). If E < H, it has less parameters. | |
- Layers are split in groups that share parameters (to save memory). | |
Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have been swapped or not. | |
This model was contributed by [lysandre](https://huggingface.co/lysandre). This model jax version was contributed by | |
[kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/ALBERT). | |
## Documentation resources | |
- [Text classification task guide](../tasks/sequence_classification) | |
- [Token classification task guide](../tasks/token_classification) | |
- [Question answering task guide](../tasks/question_answering) | |
- [Masked language modeling task guide](../tasks/masked_language_modeling) | |
- [Multiple choice task guide](../tasks/multiple_choice) | |
## AlbertConfig | |
[[autodoc]] AlbertConfig | |
## AlbertTokenizer | |
[[autodoc]] AlbertTokenizer | |
- build_inputs_with_special_tokens | |
- get_special_tokens_mask | |
- create_token_type_ids_from_sequences | |
- save_vocabulary | |
## AlbertTokenizerFast | |
[[autodoc]] AlbertTokenizerFast | |
## Albert specific outputs | |
[[autodoc]] models.albert.modeling_albert.AlbertForPreTrainingOutput | |
[[autodoc]] models.albert.modeling_tf_albert.TFAlbertForPreTrainingOutput | |
## AlbertModel | |
[[autodoc]] AlbertModel | |
- forward | |
## AlbertForPreTraining | |
[[autodoc]] AlbertForPreTraining | |
- forward | |
## AlbertForMaskedLM | |
[[autodoc]] AlbertForMaskedLM | |
- forward | |
## AlbertForSequenceClassification | |
[[autodoc]] AlbertForSequenceClassification | |
- forward | |
## AlbertForMultipleChoice | |
[[autodoc]] AlbertForMultipleChoice | |
## AlbertForTokenClassification | |
[[autodoc]] AlbertForTokenClassification | |
- forward | |
## AlbertForQuestionAnswering | |
[[autodoc]] AlbertForQuestionAnswering | |
- forward | |
## TFAlbertModel | |
[[autodoc]] TFAlbertModel | |
- call | |
## TFAlbertForPreTraining | |
[[autodoc]] TFAlbertForPreTraining | |
- call | |
## TFAlbertForMaskedLM | |
[[autodoc]] TFAlbertForMaskedLM | |
- call | |
## TFAlbertForSequenceClassification | |
[[autodoc]] TFAlbertForSequenceClassification | |
- call | |
## TFAlbertForMultipleChoice | |
[[autodoc]] TFAlbertForMultipleChoice | |
- call | |
## TFAlbertForTokenClassification | |
[[autodoc]] TFAlbertForTokenClassification | |
- call | |
## TFAlbertForQuestionAnswering | |
[[autodoc]] TFAlbertForQuestionAnswering | |
- call | |
## FlaxAlbertModel | |
[[autodoc]] FlaxAlbertModel | |
- __call__ | |
## FlaxAlbertForPreTraining | |
[[autodoc]] FlaxAlbertForPreTraining | |
- __call__ | |
## FlaxAlbertForMaskedLM | |
[[autodoc]] FlaxAlbertForMaskedLM | |
- __call__ | |
## FlaxAlbertForSequenceClassification | |
[[autodoc]] FlaxAlbertForSequenceClassification | |
- __call__ | |
## FlaxAlbertForMultipleChoice | |
[[autodoc]] FlaxAlbertForMultipleChoice | |
- __call__ | |
## FlaxAlbertForTokenClassification | |
[[autodoc]] FlaxAlbertForTokenClassification | |
- __call__ | |
## FlaxAlbertForQuestionAnswering | |
[[autodoc]] FlaxAlbertForQuestionAnswering | |
- __call__ | |