File size: 5,943 Bytes

---
license: apache-2.0
datasets:
- librispeech_asr
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- automatic-speech-recognition
- ONNX
- Intel® Neural Compressor
- neural-compressor
library_name: transformers
---
## INT4 Whisper tiny ONNX Model

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning. This is the repository of INT4 weight only quantization for the Whisper tiny model in ONNX format, powered by [Intel® Neural Compressor](https://github.com/intel/neural-compressor) and [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers).

This INT4 ONNX model is generated by [Intel® Neural Compressor](https://github.com/intel/neural-compressor)'s weight-only quantization method.


| Model Detail | Description |
| ----------- | ----------- | 
| Model Authors - Company | Intel | 
| Date | October 8, 2023 | 
| Version | 1 | 
| Type | Speech Recognition | 
| Paper or Other Resources | - | 
| License | Apache 2.0 |
| Questions or Comments | [Community Tab](https://huggingface.co/Intel/whisper-tiny-onnx-int4/discussions)|

| Intended Use | Description |
| ----------- | ----------- | 
| Primary intended uses | You can use the raw model for automatic speech recognition inference | 
| Primary intended users | Anyone doing automatic speech recognition inference | 
| Out-of-scope uses | This model in most cases will need to be fine-tuned for your particular task.  The model should not be used to intentionally create hostile or alienating environments for people.|

### Export to ONNX Model

The FP32 model is exported with openai/whisper-tiny:

```shell
optimum-cli export onnx --model openai/whisper-tiny whisper-tiny-with-past/ --task automatic-speech-recognition-with-past --opset 13
```

### Install ONNX Runtime

Install `onnxruntime>=1.16.0` to support [`MatMulFpQ4`](https://github.com/microsoft/onnxruntime/blob/v1.16.0/docs/ContribOperators.md#com.microsoft.MatMulFpQ4) operator.

### Run Quantization

Build [Intel® Neural Compressor](https://github.com/intel/neural-compressor/tree/master) from master branch and run INT4 weight-only quantization.

The weight-only quantization cofiguration is as below:
| dtype | group_size | scheme | algorithm |
| :----- | :---------- | :------ | :--------- |
| INT4  | 32        | asym   | RTN       |

We provide the key code below. For the complete script, please refer to [whisper example](https://github.com/intel/intel-extension-for-transformers/tree/main/examples/huggingface/onnxruntime/speech-recognition/quantization).

```python
from neural_compressor import quantization, PostTrainingQuantConfig
from neural_compressor.utils.constant import FP32

model_list = ['encoder_model.onnx', 'decoder_model.onnx', 'decoder_with_past_model.onnx']
for model in model_list:
    config = PostTrainingQuantConfig(
        approach="weight_only",
        calibration_sampling_size=[8],
        op_type_dict={".*": {"weight": {"bits": 4, 
                                        "algorithm": ["RTN"], 
                                        "scheme": ["asym"], 
                                        "group_size": 32}}},
        op_name_dict={'/proj_out/MatMul': FP32},) # fallback last matmul in decoder to FP32
    q_model = quantization.fit(
        os.path.join("/path/to/whisper-tiny-with-past", model), # FP32 model path
        config,
        calib_dataloader=dataloader)
    q_model.save(os.path.join("/path/to/whisper-tiny-onnx-int4", model)) # INT4 model path
```

### Evaluation

**Operator Statistics**

Below shows the operator statistics in the INT4 ONNX model:
|Model| Op Type | Total |  INT4 weight |  FP32 weight |
|:-------:|:-------:|:-------:|:-------:|:-------:|
|encoder_model|  MatMul |  32  |    24    |   8   |
|decoder_model|  MatMul |  57  |    40    |   17   |
|decoder_with_past_model|  MatMul |  49  |    32    |   17   |

**Evaluation of wer**

Evaluate the model on `librispeech_asr` dataset with below code:

```python
import os
from evaluate import load
from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor, AutoConfig
model_name = 'openai/whisper-tiny'
model_path = 'whisper-tiny-onnx-int4'
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
wer = load("wer")
librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import PretrainedConfig
model_config = PretrainedConfig.from_pretrained(model_name)
predictions = []
references = []
sessions = ORTModelForSpeechSeq2Seq.load_model(
            os.path.join(model_path, 'encoder_model.onnx'),
            os.path.join(model_path, 'decoder_model.onnx'),
            os.path.join(model_path, 'decoder_with_past_model.onnx'))
model = ORTModelForSpeechSeq2Seq(sessions[0], sessions[1], model_config, model_path, sessions[2])
for idx, batch in enumerate(librispeech_test_clean):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    reference = processor.tokenizer._normalize(batch['text'])
    references.append(reference)
    predicted_ids = model.generate(input_features)[0]
    transcription = processor.decode(predicted_ids)
    prediction = processor.tokenizer._normalize(transcription)
    predictions.append(prediction)
wer_result = wer.compute(references=references, predictions=predictions)
print(f"Result wer: {wer_result * 100}")
```

## Metrics (Model Performance):
| Model  | Model Size (MB) | wer |
|---|:---:|:---:|
| FP32 |406|7.56|
| INT4 |326|9.94|