Spaces:

chendl
/

compositional_test

Runtime error

App Files Files Community

compositional_test / transformers /docs /source /en /model_doc /qdqbert.mdx

chendl

add requirements

a1d409e almost 2 years ago

raw

history blame

6.58 kB

	<!--Copyright 2021 NVIDIA Corporation and The HuggingFace Team. All rights reserved.

	Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
	an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
	specific language governing permissions and limitations under the License.
	-->

	# QDQBERT

	## Overview

	The QDQBERT model can be referenced in [Integer Quantization for Deep Learning Inference: Principles and Empirical
	Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius
	Micikevicius.

	The abstract from the paper is the following:

	*Quantization techniques can reduce the size of Deep Neural Networks and improve inference latency and throughput by
	taking advantage of high throughput integer instructions. In this paper we review the mathematical aspects of
	quantization parameters and evaluate their choices on a wide range of neural network models for different application
	domains, including vision, speech, and language. We focus on quantization techniques that are amenable to acceleration
	by processors with high-throughput integer math pipelines. We also present a workflow for 8-bit quantization that is
	able to maintain accuracy within 1% of the floating-point baseline on all networks studied, including models that are
	more difficult to quantize, such as MobileNets and BERT-large.*

	Tips:

	- QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to (i) linear layer
	inputs and weights, (ii) matmul inputs, (iii) residual add inputs, in BERT model.

	- QDQBERT requires the dependency of [Pytorch Quantization Toolkit](https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization). To install `pip install pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com`

	- QDQBERT model can be loaded from any checkpoint of HuggingFace BERT model (for example bert-base-uncased), and
	perform Quantization Aware Training/Post Training Quantization.

	- A complete example of using QDQBERT model to perform Quatization Aware Training and Post Training Quantization for
	SQUAD task can be found at [transformers/examples/research_projects/quantization-qdqbert/](examples/research_projects/quantization-qdqbert/).

	This model was contributed by [shangz](https://huggingface.co/shangz).


	### Set default quantizers

	QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to BERT by
	`TensorQuantizer` in [Pytorch Quantization Toolkit](https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization). `TensorQuantizer` is the module
	for quantizing tensors, with `QuantDescriptor` defining how the tensor should be quantized. Refer to [Pytorch
	Quantization Toolkit userguide](https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/userguide.html) for more details.

	Before creating QDQBERT model, one has to set the default `QuantDescriptor` defining default tensor quantizers.

	Example:

	```python
	>>> import pytorch_quantization.nn as quant_nn
	>>> from pytorch_quantization.tensor_quant import QuantDescriptor

	>>> # The default tensor quantizer is set to use Max calibration method
	>>> input_desc = QuantDescriptor(num_bits=8, calib_method="max")
	>>> # The default tensor quantizer is set to be per-channel quantization for weights
	>>> weight_desc = QuantDescriptor(num_bits=8, axis=((0,)))
	>>> quant_nn.QuantLinear.set_default_quant_desc_input(input_desc)
	>>> quant_nn.QuantLinear.set_default_quant_desc_weight(weight_desc)
	```

	### Calibration

	Calibration is the terminology of passing data samples to the quantizer and deciding the best scaling factors for
	tensors. After setting up the tensor quantizers, one can use the following example to calibrate the model:

	```python
	>>> # Find the TensorQuantizer and enable calibration
	>>> for name, module in model.named_modules():
	... if name.endswith("_input_quantizer"):
	... module.enable_calib()
	... module.disable_quant() # Use full precision data to calibrate

	>>> # Feeding data samples
	>>> model(x)
	>>> # ...

	>>> # Finalize calibration
	>>> for name, module in model.named_modules():
	... if name.endswith("_input_quantizer"):
	... module.load_calib_amax()
	... module.enable_quant()

	>>> # If running on GPU, it needs to call .cuda() again because new tensors will be created by calibration process
	>>> model.cuda()

	>>> # Keep running the quantized model
	>>> # ...
	```

	### Export to ONNX

	The goal of exporting to ONNX is to deploy inference by [TensorRT](https://developer.nvidia.com/tensorrt). Fake
	quantization will be broken into a pair of QuantizeLinear/DequantizeLinear ONNX ops. After setting static member of
	TensorQuantizer to use Pytorch’s own fake quantization functions, fake quantized model can be exported to ONNX, follow
	the instructions in [torch.onnx](https://pytorch.org/docs/stable/onnx.html). Example:

	```python
	>>> from pytorch_quantization.nn import TensorQuantizer

	>>> TensorQuantizer.use_fb_fake_quant = True

	>>> # Load the calibrated model
	>>> ...
	>>> # ONNX export
	>>> torch.onnx.export(...)
	```

	## Documentation resources

	- [Text classification task guide](../tasks/sequence_classification)
	- [Token classification task guide](../tasks/token_classification)
	- [Question answering task guide](../tasks/question_answering)
	- [Causal language modeling task guide](../tasks/language_modeling)
	- [Masked language modeling task guide](../tasks/masked_language_modeling)
	- [Multiple choice task guide](../tasks/multiple_choice)

	## QDQBertConfig

	[[autodoc]] QDQBertConfig

	## QDQBertModel

	[[autodoc]] QDQBertModel
	- forward

	## QDQBertLMHeadModel

	[[autodoc]] QDQBertLMHeadModel
	- forward

	## QDQBertForMaskedLM

	[[autodoc]] QDQBertForMaskedLM
	- forward

	## QDQBertForSequenceClassification

	[[autodoc]] QDQBertForSequenceClassification
	- forward

	## QDQBertForNextSentencePrediction

	[[autodoc]] QDQBertForNextSentencePrediction
	- forward

	## QDQBertForMultipleChoice

	[[autodoc]] QDQBertForMultipleChoice
	- forward

	## QDQBertForTokenClassification

	[[autodoc]] QDQBertForTokenClassification
	- forward

	## QDQBertForQuestionAnswering

	[[autodoc]] QDQBertForQuestionAnswering
	- forward