bconsolvo's picture
Enriching model card for improved discoverability and consumption (#1)
5abbd44 verified
metadata
license: apache-2.0
tags:
  - neural-compressor
  - 8-bit
  - int8
  - Intel® Neural Compressor
  - PostTrainingStatic
  - onnx
datasets:
  - squad
metrics:
  - f1

Model Card for INT8 DistilBERT Base Uncased Fine-Tuned on SQuAD

This model is an INT8 quantized version of DistilBERT base uncased, which has been fine-tuned on the Stanford Question Answering Dataset (SQuAD). The quantization was performed using the Hugging Face's Optimum-Intel, leveraging the Intel® Neural Compressor.

Model Detail Description
Model Authors Xin He Zixuan Cheng Yu Wenz
Date Aug 4, 2022
Version The base model for this quantization process was distilbert-base-uncased-distilled-squad, a distilled version of BERT designed for the question-answering task.
Type Language Model
Paper or Other Resources Base Model: distilbert-base-uncased-distilled-squad
License apache-2.0
Questions or Comments Community Tab and Intel DevHub Discord
Quantization Details The model underwent post-training static quantization to convert it from its original FP32 precision to INT8, optimizing for size and inference speed while aiming to retain as much of the original model's accuracy as possible.
Calibration Details For PyTorch, the calibration dataloader was the train dataloader with a real sampling size of 304 due to the default calibration sampling size of 300 not being exactly divisible by the batch size of 8. For the ONNX version, the calibration was performed using the eval dataloader with a default calibration sampling size of 100.
Intended Use Description
Primary intended uses This model is intended for question-answering tasks, where it can provide answers to questions given a context passage. It is optimized for scenarios requiring fast inference and reduced model size without significantly compromising accuracy.
Primary intended users Researchers, developers, and enterprises that require efficient, low-latency question answering capabilities in their applications, particularly where computational resources are limited.
Out-of-scope uses

Evaluation

PyTorch Version

This is an INT8 PyTorch model quantized with huggingface/optimum-intel through the usage of Intel® Neural Compressor.

INT8 FP32
Accuracy (eval-f1) 86.1069 86.8374
Model size (MB) 74.7 265

ONNX Version

This is an INT8 ONNX model quantized with Intel® Neural Compressor.

INT8 FP32
Accuracy (eval-f1) 0.8633 0.8687
Model size (MB) 154 254

Usage

Optimum Intel w/ Neural Compressor

from optimum.intel import INCModelForQuestionAnswering

model_id = "Intel/distilbert-base-uncased-distilled-squad-int8-static"
int8_model = INCModelForQuestionAnswering.from_pretrained(model_id)

Optimum w/ ONNX Runtime

from optimum.onnxruntime import ORTModelForQuestionAnswering
model = ORTModelForQuestionAnswering.from_pretrained('Intel/distilbert-base-uncased-distilled-squad-int8-static')

Ethical Considerations

While not explicitly mentioned, users should be aware of potential biases present in the training data (SQuAD and Wikipedia), and consider the implications of these biases on the model's outputs. Additionally, quantization may introduce or exacerbate biases in certain scenarios.

Caveats and Recommendations

  • Users should consider the balance between performance and accuracy when deploying quantized models in critical applications.
  • Further fine-tuning or calibration may be necessary for specific use cases or to meet stricter accuracy requirements.