|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- librispeech_asr |
|
metrics: |
|
- wer |
|
pipeline_tag: automatic-speech-recognition |
|
tags: |
|
- automatic-speech-recognition |
|
- ONNX |
|
- Intel® Neural Compressor |
|
- neural-compressor |
|
library_name: transformers |
|
--- |
|
## INT4 Whisper tiny ONNX Model |
|
|
|
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning. This is the repository of INT4 weight only quantization for the Whisper tiny model in ONNX format, powered by [Intel® Neural Compressor](https://github.com/intel/neural-compressor) and [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers). |
|
|
|
This INT4 ONNX model is generated by [Intel® Neural Compressor](https://github.com/intel/neural-compressor)'s weight-only quantization method. |
|
|
|
|
|
| Model Detail | Description | |
|
| ----------- | ----------- | |
|
| Model Authors - Company | Intel | |
|
| Date | October 8, 2023 | |
|
| Version | 1 | |
|
| Type | Speech Recognition | |
|
| Paper or Other Resources | - | |
|
| License | Apache 2.0 | |
|
| Questions or Comments | [Community Tab](https://huggingface.co/Intel/whisper-tiny-onnx-int4/discussions)| |
|
|
|
| Intended Use | Description | |
|
| ----------- | ----------- | |
|
| Primary intended uses | You can use the raw model for automatic speech recognition inference | |
|
| Primary intended users | Anyone doing automatic speech recognition inference | |
|
| Out-of-scope uses | This model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people.| |
|
|
|
### Export to ONNX Model |
|
|
|
The FP32 model is exported with openai/whisper-tiny: |
|
|
|
```shell |
|
optimum-cli export onnx --model openai/whisper-tiny whisper-tiny-with-past/ --task automatic-speech-recognition-with-past --opset 13 |
|
``` |
|
|
|
### Install ONNX Runtime |
|
|
|
Install `onnxruntime>=1.16.0` to support [`MatMulFpQ4`](https://github.com/microsoft/onnxruntime/blob/v1.16.0/docs/ContribOperators.md#com.microsoft.MatMulFpQ4) operator. |
|
|
|
### Run Quantization |
|
|
|
Build [Intel® Neural Compressor](https://github.com/intel/neural-compressor/tree/master) from master branch and run INT4 weight-only quantization. |
|
|
|
The weight-only quantization cofiguration is as below: |
|
| dtype | group_size | scheme | algorithm | |
|
| :----- | :---------- | :------ | :--------- | |
|
| INT4 | 32 | asym | RTN | |
|
|
|
We provide the key code below. For the complete script, please refer to [whisper example](https://github.com/intel/intel-extension-for-transformers/tree/main/examples/huggingface/onnxruntime/speech-recognition/quantization). |
|
|
|
```python |
|
from neural_compressor import quantization, PostTrainingQuantConfig |
|
from neural_compressor.utils.constant import FP32 |
|
|
|
model_list = ['encoder_model.onnx', 'decoder_model.onnx', 'decoder_with_past_model.onnx'] |
|
for model in model_list: |
|
config = PostTrainingQuantConfig( |
|
approach="weight_only", |
|
calibration_sampling_size=[8], |
|
op_type_dict={".*": {"weight": {"bits": 4, |
|
"algorithm": ["RTN"], |
|
"scheme": ["asym"], |
|
"group_size": 32}}}, |
|
op_name_dict={'/proj_out/MatMul': FP32},) # fallback last matmul in decoder to FP32 |
|
q_model = quantization.fit( |
|
os.path.join("/path/to/whisper-tiny-with-past", model), # FP32 model path |
|
config, |
|
calib_dataloader=dataloader) |
|
q_model.save(os.path.join("/path/to/whisper-tiny-onnx-int4", model)) # INT4 model path |
|
``` |
|
|
|
### Evaluation |
|
|
|
**Operator Statistics** |
|
|
|
Below shows the operator statistics in the INT4 ONNX model: |
|
|Model| Op Type | Total | INT4 weight | FP32 weight | |
|
|:-------:|:-------:|:-------:|:-------:|:-------:| |
|
|encoder_model| MatMul | 32 | 24 | 8 | |
|
|decoder_model| MatMul | 57 | 40 | 17 | |
|
|decoder_with_past_model| MatMul | 49 | 32 | 17 | |
|
|
|
**Evaluation of wer** |
|
|
|
Evaluate the model on `librispeech_asr` dataset with below code: |
|
|
|
```python |
|
import os |
|
from evaluate import load |
|
from datasets import load_dataset |
|
from transformers import WhisperForConditionalGeneration, WhisperProcessor, AutoConfig |
|
model_name = 'openai/whisper-tiny' |
|
model_path = 'whisper-tiny-onnx-int4' |
|
processor = WhisperProcessor.from_pretrained(model_name) |
|
model = WhisperForConditionalGeneration.from_pretrained(model_name) |
|
config = AutoConfig.from_pretrained(model_name) |
|
wer = load("wer") |
|
librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test") |
|
|
|
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq |
|
from transformers import PretrainedConfig |
|
model_config = PretrainedConfig.from_pretrained(model_name) |
|
predictions = [] |
|
references = [] |
|
sessions = ORTModelForSpeechSeq2Seq.load_model( |
|
os.path.join(model_path, 'encoder_model.onnx'), |
|
os.path.join(model_path, 'decoder_model.onnx'), |
|
os.path.join(model_path, 'decoder_with_past_model.onnx')) |
|
model = ORTModelForSpeechSeq2Seq(sessions[0], sessions[1], model_config, model_path, sessions[2]) |
|
for idx, batch in enumerate(librispeech_test_clean): |
|
audio = batch["audio"] |
|
input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features |
|
reference = processor.tokenizer._normalize(batch['text']) |
|
references.append(reference) |
|
predicted_ids = model.generate(input_features)[0] |
|
transcription = processor.decode(predicted_ids) |
|
prediction = processor.tokenizer._normalize(transcription) |
|
predictions.append(prediction) |
|
wer_result = wer.compute(references=references, predictions=predictions) |
|
print(f"Result wer: {wer_result * 100}") |
|
``` |
|
|
|
## Metrics (Model Performance): |
|
| Model | Model Size (MB) | wer | |
|
|---|:---:|:---:| |
|
| FP32 |406|7.56| |
|
| INT4 |326|9.94| |