|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# DistilBERT |
|
|
|
<div class="flex flex-wrap space-x-1"> |
|
<a href="https://huggingface.co/models?filter=distilbert"> |
|
<img alt="Models" src="https://img.shields.io/badge/All_model_pages-distilbert-blueviolet"> |
|
</a> |
|
<a href="https://huggingface.co/spaces/docs-demos/distilbert-base-uncased"> |
|
<img alt="Spaces" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"> |
|
</a> |
|
</div> |
|
|
|
## Overview |
|
|
|
The DistilBERT model was proposed in the blog post [Smaller, faster, cheaper, lighter: Introducing DistilBERT, a |
|
distilled version of BERT](https://medium.com/huggingface/distilbert-8cf3380435b5), and the paper [DistilBERT, a |
|
distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108). DistilBERT is a |
|
small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than |
|
*bert-base-uncased*, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language |
|
understanding benchmark. |
|
|
|
The abstract from the paper is the following: |
|
|
|
*As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), |
|
operating these large models in on-the-edge and/or under constrained computational training or inference budgets |
|
remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation |
|
model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger |
|
counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage |
|
knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by |
|
40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive |
|
biases learned by larger models during pretraining, we introduce a triple loss combining language modeling, |
|
distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we |
|
demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device |
|
study.* |
|
|
|
Tips: |
|
|
|
- DistilBERT doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just |
|
separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`). |
|
- DistilBERT doesn't have options to select the input positions (`position_ids` input). This could be added if |
|
necessary though, just let us know if you need this option. |
|
- Same as BERT but smaller. Trained by distillation of the pretrained BERT model, meaning itβs been trained to predict the same probabilities as the larger model. The actual objective is a combination of: |
|
|
|
* finding the same probabilities as the teacher model |
|
* predicting the masked tokens correctly (but no next-sentence objective) |
|
* a cosine similarity between the hidden states of the student and the teacher model |
|
|
|
This model was contributed by [victorsanh](https://huggingface.co/victorsanh). This model jax version was |
|
contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation). |
|
|
|
## Resources |
|
|
|
A list of official Hugging Face and community (indicated by π) resources to help you get started with DistilBERT. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. |
|
|
|
<PipelineTag pipeline="text-classification"/> |
|
|
|
- A blog post on [Getting Started with Sentiment Analysis using Python](https://huggingface.co/blog/sentiment-analysis-python) with DistilBERT. |
|
- A blog post on how to [train DistilBERT with Blurr for sequence classification](https://huggingface.co/blog/fastai). |
|
- A blog post on how to use [Ray to tune DistilBERT hyperparameters](https://huggingface.co/blog/ray-tune). |
|
- A blog post on how to [train DistilBERT with Hugging Face and Amazon SageMaker](https://huggingface.co/blog/the-partnership-amazon-sagemaker-and-hugging-face). |
|
- A notebook on how to [finetune DistilBERT for multi-label classification](https://colab.research.google.com/github/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb). π |
|
- A notebook on how to [finetune DistilBERT for multiclass classification with PyTorch](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb). π |
|
- A notebook on how to [finetune DistilBERT for text classification in TensorFlow](https://colab.research.google.com/github/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb). π |
|
- [`DistilBertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb). |
|
- [`TFDistilBertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb). |
|
- [`FlaxDistilBertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification_flax.ipynb). |
|
- [Text classification task guide](../tasks/sequence_classification) |
|
|
|
|
|
<PipelineTag pipeline="token-classification"/> |
|
|
|
- [`DistilBertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb). |
|
- [`TFDistilBertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/token-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb). |
|
- [`FlaxDistilBertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/token-classification). |
|
- [Token classification](https://huggingface.co/course/chapter7/2?fw=pt) chapter of the π€ Hugging Face Course. |
|
- [Token classification task guide](../tasks/token_classification) |
|
|
|
|
|
<PipelineTag pipeline="fill-mask"/> |
|
|
|
- [`DistilBertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#robertabertdistilbert-and-masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb). |
|
- [`TFDistilBertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling#run_mlmpy) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb). |
|
- [`FlaxDistilBertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/masked_language_modeling_flax.ipynb). |
|
- [Masked language modeling](https://huggingface.co/course/chapter7/3?fw=pt) chapter of the π€ Hugging Face Course. |
|
- [Masked language modeling task guide](../tasks/masked_language_modeling) |
|
|
|
<PipelineTag pipeline="question-answering"/> |
|
|
|
- [`DistilBertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb). |
|
- [`TFDistilBertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb). |
|
- [`FlaxDistilBertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/question-answering). |
|
- [Question answering](https://huggingface.co/course/chapter7/7?fw=pt) chapter of the π€ Hugging Face Course. |
|
- [Question answering task guide](../tasks/question_answering) |
|
|
|
**Multiple choice** |
|
- [`DistilBertForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb). |
|
- [`TFDistilBertForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb). |
|
- [Multiple choice task guide](../tasks/multiple_choice) |
|
|
|
βοΈ Optimization |
|
|
|
- A blog post on how to [quantize DistilBERT with π€ Optimum and Intel](https://huggingface.co/blog/intel). |
|
- A blog post on how [Optimizing Transformers for GPUs with π€ Optimum](https://www.philschmid.de/optimizing-transformers-with-optimum-gpu). |
|
- A blog post on [Optimizing Transformers with Hugging Face Optimum](https://www.philschmid.de/optimizing-transformers-with-optimum). |
|
|
|
β‘οΈ Inference |
|
|
|
- A blog post on how to [Accelerate BERT inference with Hugging Face Transformers and AWS Inferentia](https://huggingface.co/blog/bert-inferentia-sagemaker) with DistilBERT. |
|
- A blog post on [Serverless Inference with Hugging Face's Transformers, DistilBERT and Amazon SageMaker](https://www.philschmid.de/sagemaker-serverless-huggingface-distilbert). |
|
|
|
π Deploy |
|
|
|
- A blog post on how to [deploy DistilBERT on Google Cloud](https://huggingface.co/blog/how-to-deploy-a-pipeline-to-google-clouds). |
|
- A blog post on how to [deploy DistilBERT with Amazon SageMaker](https://huggingface.co/blog/deploy-hugging-face-models-easily-with-amazon-sagemaker). |
|
- A blog post on how to [Deploy BERT with Hugging Face Transformers, Amazon SageMaker and Terraform module](https://www.philschmid.de/terraform-huggingface-amazon-sagemaker). |
|
|
|
## DistilBertConfig |
|
|
|
[[autodoc]] DistilBertConfig |
|
|
|
## DistilBertTokenizer |
|
|
|
[[autodoc]] DistilBertTokenizer |
|
|
|
## DistilBertTokenizerFast |
|
|
|
[[autodoc]] DistilBertTokenizerFast |
|
|
|
## DistilBertModel |
|
|
|
[[autodoc]] DistilBertModel |
|
- forward |
|
|
|
## DistilBertForMaskedLM |
|
|
|
[[autodoc]] DistilBertForMaskedLM |
|
- forward |
|
|
|
## DistilBertForSequenceClassification |
|
|
|
[[autodoc]] DistilBertForSequenceClassification |
|
- forward |
|
|
|
## DistilBertForMultipleChoice |
|
|
|
[[autodoc]] DistilBertForMultipleChoice |
|
- forward |
|
|
|
## DistilBertForTokenClassification |
|
|
|
[[autodoc]] DistilBertForTokenClassification |
|
- forward |
|
|
|
## DistilBertForQuestionAnswering |
|
|
|
[[autodoc]] DistilBertForQuestionAnswering |
|
- forward |
|
|
|
## TFDistilBertModel |
|
|
|
[[autodoc]] TFDistilBertModel |
|
- call |
|
|
|
## TFDistilBertForMaskedLM |
|
|
|
[[autodoc]] TFDistilBertForMaskedLM |
|
- call |
|
|
|
## TFDistilBertForSequenceClassification |
|
|
|
[[autodoc]] TFDistilBertForSequenceClassification |
|
- call |
|
|
|
## TFDistilBertForMultipleChoice |
|
|
|
[[autodoc]] TFDistilBertForMultipleChoice |
|
- call |
|
|
|
## TFDistilBertForTokenClassification |
|
|
|
[[autodoc]] TFDistilBertForTokenClassification |
|
- call |
|
|
|
## TFDistilBertForQuestionAnswering |
|
|
|
[[autodoc]] TFDistilBertForQuestionAnswering |
|
- call |
|
|
|
## FlaxDistilBertModel |
|
|
|
[[autodoc]] FlaxDistilBertModel |
|
- __call__ |
|
|
|
## FlaxDistilBertForMaskedLM |
|
|
|
[[autodoc]] FlaxDistilBertForMaskedLM |
|
- __call__ |
|
|
|
## FlaxDistilBertForSequenceClassification |
|
|
|
[[autodoc]] FlaxDistilBertForSequenceClassification |
|
- __call__ |
|
|
|
## FlaxDistilBertForMultipleChoice |
|
|
|
[[autodoc]] FlaxDistilBertForMultipleChoice |
|
- __call__ |
|
|
|
## FlaxDistilBertForTokenClassification |
|
|
|
[[autodoc]] FlaxDistilBertForTokenClassification |
|
- __call__ |
|
|
|
## FlaxDistilBertForQuestionAnswering |
|
|
|
[[autodoc]] FlaxDistilBertForQuestionAnswering |
|
- __call__ |
|
|