{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Post-Training Quantization of OpenAI Whisper model with NNCF\n",
    "\n",
    "The goal of this tutorial is to demonstrate how to speed up the model by applying 8-bit post-training quantization from [NNCF](https://github.com/openvinotoolkit/nncf/) (Neural Network Compression Framework) and infer quantized model via OpenVINO™ Toolkit. The optimization process contains the following steps:\n",
    "\n",
    "1. Quantize the converted OpenVINO model from [whisper-convert notebook](whisper-convert.ipynb) with NNCF.\n",
    "2. Check model result for the demo video.\n",
    "3. Compare model size, performance and accuracy of FP32 and quantized INT8 models.\n",
    "\n",
    "> **NOTE**: you should run [whisper-convert](whisper-convert.ipynb) notebook first to generate OpenVINO IR model that is used for quantization.\n",
    "\n",
    "\n",
    "#### Table of contents:\n",
    "\n",
    "- [Prerequisites](#Prerequisites)\n",
    "- [Create and initialize quantization](#Create-and-initialize-quantization)\n",
    "    - [Prepare calibration datasets](#Prepare-calibration-datasets)\n",
    "    - [Quantize Whisper encoder and decoder models](#Quantize-Whisper-encoder-and-decoder-models)\n",
    "- [Transcribe video with quantized OpenVINO model](#Transcribe-video-with-quantized-OpenVINO-model)\n",
    "- [Compare performance and accuracy of the FP32 and INT8 IRs](#Compare-performance-and-accuracy-of-the-FP32-and-INT8-IRs)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "[back to top ⬆️](#Table-of-contents:)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Install dependencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-08-24T07:22:38.138055200Z",
     "start_time": "2023-08-24T07:22:38.000326500Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "%pip install -q \"openvino>=2023.1.0\"\n",
    "%pip install -q \"nncf>=2.6.0\"\n",
    "%pip install -q datasets librosa soundfile\n",
    "%pip install -q evaluate jiwer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Select model for quantization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2e6dde4a5c3344b88772ddc9b5d94510",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Dropdown(description='Model:', options=('large-v2', 'large-v3'), value='large-v2')"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "import ipywidgets as widgets\n",
    "\n",
    "\n",
    "def get_model_id(model_path):\n",
    "    return model_path.name.replace(\"whisper_\", \"\").replace(\"encoder.xml\", \"\").replace(\"_\", \"\")\n",
    "\n",
    "\n",
    "model_list = [get_model_id(model_path) for model_path in Path(\".\").glob(\"whisper_*encoder.xml\")]\n",
    "model_list = [model_name for model_name in model_list if model_name]\n",
    "\n",
    "if not model_list:\n",
    "    raise RuntimeError(\"Please run conversion notebook first\")\n",
    "\n",
    "model_id = widgets.Dropdown(\n",
    "    options=model_list,\n",
    "    value=model_list[0],\n",
    "    description=\"Model:\",\n",
    "    disabled=False,\n",
    ")\n",
    "\n",
    "model_id"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Select device from dropdown list for running inference using OpenVINO."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-08-24T07:22:38.357268200Z",
     "start_time": "2023-08-24T07:22:38.250550900Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "09e511523d2f4cf09655829a27347e61",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Dropdown(description='Device:', index=2, options=('CPU', 'GPU', 'AUTO'), value='AUTO')"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import ipywidgets as widgets\n",
    "\n",
    "from openvino import Core\n",
    "\n",
    "core = Core()\n",
    "\n",
    "device = widgets.Dropdown(\n",
    "    options=core.available_devices + [\"AUTO\"],\n",
    "    value=\"AUTO\",\n",
    "    description=\"Device:\",\n",
    "    disabled=False,\n",
    ")\n",
    "\n",
    "device"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Select the task for the model:\n",
    "\n",
    "* **transcribe** - generate audio transcription in the source language (automatically detected).\n",
    "* **translate** - generate audio transcription with translation to English language."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-08-24T07:22:38.359102800Z",
     "start_time": "2023-08-24T07:22:38.357540500Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "83871b40db7e48b693dfca1f39c58eb0",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Select(description='Select task:', index=1, options=('transcribe', 'translate'), value='translate')"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "task = widgets.Select(\n",
    "    options=[\"transcribe\", \"translate\"],\n",
    "    value=\"translate\",\n",
    "    description=\"Select task:\",\n",
    "    disabled=False,\n",
    ")\n",
    "task"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## Create and initialize quantization\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "[NNCF](https://github.com/openvinotoolkit/nncf/) enables post-training quantization by adding the quantization layers into the model graph and then using a subset of the training dataset to initialize the parameters of these additional quantization layers. The framework is designed so that modifications to your original training code are minor. Quantization is the simplest scenario and requires a few modifications.\n",
    "\n",
    "The optimization process contains the following steps:\n",
    "\n",
    "1. Create a calibration dataset for quantization.\n",
    "2. Run `nncf.quantize` to obtain quantized models.\n",
    "3. Serialize the `INT8` model using `openvino.runtime.serialize` function."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Set paths to the model converted in [whisper-convert](whisper-convert.ipynb) notebook and the paths where quantized models will be saved."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "WHISPER_ENCODER_OV = Path(f\"whisper_{model_id.value}_encoder.xml\")\n",
    "WHISPER_DECODER_OV = Path(f\"whisper_{model_id.value}_decoder.xml\")\n",
    "\n",
    "WHISPER_ENCODER_OV_INT8 = Path(f\"whisper_{model_id.value}_encoder_int8.xml\")\n",
    "WHISPER_DECODER_OV_INT8 = Path(f\"whisper_{model_id.value}_decoder_int8.xml\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Load FP32 model IR."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-08-24T07:22:39.574746400Z",
     "start_time": "2023-08-24T07:22:38.358318Z"
    }
   },
   "outputs": [],
   "source": [
    "import whisper\n",
    "\n",
    "# Fetch `notebook_utils` module\n",
    "import requests\n",
    "\n",
    "r = requests.get(\n",
    "    url=\"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py\",\n",
    ")\n",
    "open(\"notebook_utils.py\", \"w\").write(r.text)\n",
    "from notebook_utils import download_file\n",
    "\n",
    "if not Path(\"./utils.py\").exists():\n",
    "    download_file(url=\"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/whisper-subtitles-generation/utils.py\")\n",
    "\n",
    "from utils import (\n",
    "    patch_whisper_for_ov_inference,\n",
    "    OpenVINOAudioEncoder,\n",
    "    OpenVINOTextDecoder,\n",
    ")\n",
    "\n",
    "model_fp32 = whisper.load_model(model_id.value, \"cpu\").eval()\n",
    "patch_whisper_for_ov_inference(model_fp32)\n",
    "\n",
    "model_fp32.encoder = OpenVINOAudioEncoder(core, WHISPER_ENCODER_OV, device=device.value)\n",
    "model_fp32.decoder = OpenVINOTextDecoder(core, WHISPER_DECODER_OV, device=device.value)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "### Prepare calibration datasets\n",
    "[back to top ⬆️](#Table-of-contents:)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Whisper consists of an encoder and a decoder models. We need to collect calibration data for both of them.\n",
    "\n",
    "Below we overwrite encoder/decoder forward methods in order to collect calibration samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-08-24T07:22:39.623947800Z",
     "start_time": "2023-08-24T07:22:39.575286700Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "from contextlib import contextmanager\n",
    "from functools import partial\n",
    "import openvino as ov\n",
    "from typing import Optional\n",
    "import torch\n",
    "\n",
    "COLLECT_CALIBRATION_DATA = False\n",
    "encoder_calibration_data = []\n",
    "decoder_calibration_data = []\n",
    "\n",
    "\n",
    "@contextmanager\n",
    "def calibration_data_collection():\n",
    "    global COLLECT_CALIBRATION_DATA\n",
    "    try:\n",
    "        COLLECT_CALIBRATION_DATA = True\n",
    "        yield\n",
    "    finally:\n",
    "        COLLECT_CALIBRATION_DATA = False\n",
    "\n",
    "\n",
    "def encoder_forward(self, mel: torch.Tensor):\n",
    "    if COLLECT_CALIBRATION_DATA:\n",
    "        encoder_calibration_data.append(mel)\n",
    "    return torch.from_numpy(self.compiled_model(mel)[self.output_blob])\n",
    "\n",
    "\n",
    "def decoder_forward(self, x: torch.Tensor, xa: torch.Tensor, kv_cache: Optional[dict] = None):\n",
    "    feed_dict = {\"x\": ov.Tensor(x.numpy()), \"xa\": ov.Tensor(xa.numpy())}\n",
    "    feed_dict = self.preprocess_kv_cache_inputs(feed_dict, kv_cache)\n",
    "    if COLLECT_CALIBRATION_DATA:\n",
    "        decoder_calibration_data.append(feed_dict)\n",
    "    res = self.compiled_model(feed_dict)\n",
    "    return self.postprocess_outputs(res)\n",
    "\n",
    "\n",
    "model_fp32.encoder.forward = partial(encoder_forward, model_fp32.encoder)\n",
    "model_fp32.decoder.forward = partial(decoder_forward, model_fp32.decoder)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "We use a portion of validation [librispeech_asr](https://huggingface.co/datasets/librispeech_asr) dataset from Hugging Face as calibration data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-08-24T07:23:05.269312200Z",
     "start_time": "2023-08-24T07:22:39.623947800Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "test_replace": {
     "CALIBRATION_DATASET_SIZE = 30": "CALIBRATION_DATASET_SIZE = 1"
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "50abbb740a6f49e386c8963bb2cd9b79",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Collecting calibration data:   0%|          | 0/30 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from datasets import load_dataset\n",
    "from tqdm.notebook import tqdm\n",
    "\n",
    "CALIBRATION_DATASET_SIZE = 30\n",
    "\n",
    "calibration_dataset = load_dataset(\"librispeech_asr\", \"clean\", split=\"validation\", streaming=True).take(CALIBRATION_DATASET_SIZE)\n",
    "\n",
    "with calibration_data_collection():\n",
    "    for data_item in tqdm(\n",
    "        calibration_dataset,\n",
    "        desc=\"Collecting calibration data\",\n",
    "        total=CALIBRATION_DATASET_SIZE,\n",
    "    ):\n",
    "        model_fp32.transcribe(data_item[\"audio\"][\"array\"].astype(\"float32\"), task=task.value)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "### Quantize Whisper encoder and decoder models\n",
    "[back to top ⬆️](#Table-of-contents:)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Quantize both encoder and decoder models using `nncf.quantize()` API and save the quantized IRs after that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-08-24T07:25:31.614190800Z",
     "start_time": "2023-08-24T07:23:05.269312200Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino\n",
      "Quantizing encoder...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Statistics collection: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [01:42<00:00,  1.72s/it]\n",
      "Applying Smooth Quant: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:13<00:00,  9.71it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:nncf:96 ignored nodes was found by name in the NNCFGraph\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Statistics collection: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [03:17<00:00,  3.29s/it]\n",
      "Applying Fast Bias correction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 162/162 [03:09<00:00,  1.17s/it]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Saved quantized encoder at ./whisper_large-v2_encoder_int8.xml\n",
      "Quantizing decoder...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Statistics collection: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 669/669 [03:20<00:00,  3.33it/s]\n",
      "Applying Smooth Quant: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 194/194 [00:23<00:00,  8.41it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:nncf:192 ignored nodes was found by name in the NNCFGraph\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Statistics collection: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 669/669 [07:22<00:00,  1.51it/s]\n",
      "Applying Fast Bias correction: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [04:01<00:00,  1.06it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Saved quantized decoder at ./whisper_large-v2_decoder_int8.xml\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "import nncf\n",
    "from openvino.runtime import serialize\n",
    "\n",
    "print(\"Quantizing encoder...\")\n",
    "quantized_encoder = nncf.quantize(\n",
    "    model=model_fp32.encoder.model,\n",
    "    calibration_dataset=nncf.Dataset(encoder_calibration_data),\n",
    "    subset_size=len(encoder_calibration_data),\n",
    "    model_type=nncf.ModelType.TRANSFORMER,\n",
    "    advanced_parameters=nncf.AdvancedQuantizationParameters(\n",
    "        smooth_quant_alpha=0.5  # Smooth Quant algorithm reduces activation quantization error; optimal alpha value was obtained through grid search\n",
    "    ),\n",
    ")\n",
    "serialize(quantized_encoder, WHISPER_ENCODER_OV_INT8)\n",
    "print(f\"Saved quantized encoder at ./{WHISPER_ENCODER_OV_INT8}\")\n",
    "\n",
    "print(\"Quantizing decoder...\")\n",
    "quantized_decoder = nncf.quantize(\n",
    "    model=model_fp32.decoder.model,\n",
    "    calibration_dataset=nncf.Dataset(decoder_calibration_data),\n",
    "    subset_size=len(decoder_calibration_data),\n",
    "    model_type=nncf.ModelType.TRANSFORMER,\n",
    "    advanced_parameters=nncf.AdvancedQuantizationParameters(\n",
    "        smooth_quant_alpha=0.95  # Smooth Quant algorithm reduces activation quantization error; optimal alpha value was obtained through grid search\n",
    "    ),\n",
    ")\n",
    "serialize(quantized_decoder, WHISPER_DECODER_OV_INT8)\n",
    "print(f\"Saved quantized decoder at ./{WHISPER_DECODER_OV_INT8}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transcribe video with quantized OpenVINO model\n",
    "[back to top ⬆️](#Table-of-contents:)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Load `INT8` models saved above into a new instance of Whisper model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "model_int8 = whisper.load_model(model_id.value, device=\"cpu\").eval()\n",
    "patch_whisper_for_ov_inference(model_int8)\n",
    "\n",
    "model_int8.encoder = OpenVINOAudioEncoder(core, WHISPER_ENCODER_OV_INT8, device=device.value)\n",
    "model_int8.decoder = OpenVINOTextDecoder(core, WHISPER_DECODER_OV_INT8, device=device.value)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Select a video for transcription as in [whisper-convert](whisper-convert.ipynb) notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-08-24T07:25:33.101022100Z",
     "start_time": "2023-08-24T07:25:33.090494400Z"
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6ad7fe046c4a4d319a0e5430e6e53696",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Text(value='https://youtu.be/kgL5LBM-hFI', description='Video:', placeholder='Type link for video')"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "VIDEO_LINK = \"https://youtu.be/kgL5LBM-hFI\"\n",
    "link = widgets.Text(\n",
    "    value=VIDEO_LINK,\n",
    "    placeholder=\"Type link for video\",\n",
    "    description=\"Video:\",\n",
    "    disabled=False,\n",
    ")\n",
    "link"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-08-24T07:25:33.108346600Z",
     "start_time": "2023-08-24T07:25:33.105926700Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading video https://youtu.be/kgL5LBM-hFI started\n",
      "Video saved to downloaded_video.mp4\n"
     ]
    }
   ],
   "source": [
    "from pytube import YouTube\n",
    "\n",
    "print(f\"Downloading video {link.value} started\")\n",
    "\n",
    "output_file = Path(\"downloaded_video.mp4\")\n",
    "yt = YouTube(link.value)\n",
    "yt.streams.get_highest_resolution().download(filename=output_file)\n",
    "print(f\"Video saved to {output_file}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-08-24T07:25:36.876565700Z",
     "start_time": "2023-08-24T07:25:33.120502500Z"
    }
   },
   "outputs": [],
   "source": [
    "from utils import get_audio\n",
    "\n",
    "audio, duration = get_audio(output_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Run transcription by the quantized model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-08-24T07:25:39.288568400Z",
     "start_time": "2023-08-24T07:25:36.875642800Z"
    }
   },
   "outputs": [],
   "source": [
    "transcription = model_int8.transcribe(audio, task=task.value)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-08-24T07:25:39.311444800Z",
     "start_time": "2023-08-24T07:25:39.311444800Z"
    }
   },
   "outputs": [],
   "source": [
    "from utils import prepare_srt\n",
    "\n",
    "srt_lines = prepare_srt(transcription, duration)\n",
    "# save transcription\n",
    "with output_file.with_suffix(\".srt\").open(\"w\") as f:\n",
    "    f.writelines(srt_lines)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let us see the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-08-24T07:25:39.327567Z",
     "start_time": "2023-08-24T07:25:39.311444800Z"
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "94a897cfe6804c9a99a9764493471338",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Video(value=b\"\\x00\\x00\\x00\\x18ftypmp42\\x00\\x00\\x00\\x00isommp42\\x00\\x00:'moov\\x00\\x00\\x00lmvhd...\", height='800…"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "widgets.Video.from_file(output_file, loop=False, width=800, height=800)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-08-24T07:25:39.327567Z",
     "start_time": "2023-08-24T07:25:39.327567Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n",
      "00:00:00,000 --> 00:00:05,000\n",
      " What's that?\n",
      "\n",
      "2\n",
      "00:00:05,000 --> 00:00:07,000\n",
      " Oh, wow.\n",
      "\n",
      "3\n",
      "00:00:09,000 --> 00:00:11,000\n",
      " Hello, humans.\n",
      "\n",
      "4\n",
      "00:00:13,000 --> 00:00:15,000\n",
      " Focus on me.\n",
      "\n",
      "5\n",
      "00:00:15,000 --> 00:00:17,000\n",
      " Focus on the guard.\n",
      "\n",
      "6\n",
      "00:00:17,000 --> 00:00:20,000\n",
      " Don't tell anyone what you see in here.\n",
      "\n",
      "7\n",
      "00:00:22,000 --> 00:00:24,000\n",
      " Have you seen what's in there?\n",
      "\n",
      "8\n",
      "00:00:24,000 --> 00:00:25,000\n",
      " They have...\n",
      "\n",
      "9\n",
      "00:00:25,000 --> 00:00:27,000\n",
      " Intel. This is where it all changes.\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(\"\".join(srt_lines))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "As you can see the result is almost the same."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "## Compare performance and accuracy of the FP32 and INT8 IRs\n",
    "[back to top ⬆️](#Table-of-contents:)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "Compare model file size."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-08-24T07:25:39.370126700Z",
     "start_time": "2023-08-24T07:25:39.336253900Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model: whisper_large-v2_encoder\n",
      "    * FP32 IR model size: 1244080.07 KB\n",
      "    * INT8 IR model size: 626971.58 KB\n",
      "    * Model compression rate: 1.984\n",
      "Model: whisper_large-v2_decoder\n",
      "    * FP32 IR model size: 1900607.09 KB\n",
      "    * INT8 IR model size: 955679.81 KB\n",
      "    * Model compression rate: 1.989\n"
     ]
    }
   ],
   "source": [
    "def calculate_compression_rate(model_path_ov, model_path_ov_int8):\n",
    "    model_size_fp32 = model_path_ov.with_suffix(\".bin\").stat().st_size / 1024\n",
    "    model_size_int8 = model_path_ov_int8.with_suffix(\".bin\").stat().st_size / 1024\n",
    "    print(f\"Model: {model_path_ov.stem}\")\n",
    "    print(f\"    * FP32 IR model size: {model_size_fp32:.2f} KB\")\n",
    "    print(f\"    * INT8 IR model size: {model_size_int8:.2f} KB\")\n",
    "    print(f\"    * Model compression rate: {model_size_fp32 / model_size_int8:.3f}\")\n",
    "\n",
    "\n",
    "calculate_compression_rate(WHISPER_ENCODER_OV, WHISPER_ENCODER_OV_INT8)\n",
    "calculate_compression_rate(WHISPER_DECODER_OV, WHISPER_DECODER_OV_INT8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "To measure the inference performance of the `FP32` and `INT8` encoder/decoder models, we use median inference time on calibration dataset.\n",
    "So we can approximately estimate the speed-up of the dynamic quantized models.\n",
    "\n",
    "> **NOTE**: For the most accurate performance estimation, it is recommended to run `benchmark_app` with static shapes in a terminal/command prompt after closing other applications."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2023-08-24T07:26:39.712460600Z",
     "start_time": "2023-08-24T07:25:39.370126700Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "39707d4e1bed4ca3a56aa05b6158c124",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Measuring performance:   0%|          | 0/60 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "7c2062fd20c24ae2802d4a74dfc8f030",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Measuring performance:   0%|          | 0/60 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Encoder performance speedup: 1.763\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "b054c885ee9846ffaa632e88209e8599",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Measuring performance:   0%|          | 0/100 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "03f7c23af8f24d3e877eb68fc9551e69",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Measuring performance:   0%|          | 0/100 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Decoder performance speedup: 2.022\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "def calculate_call_inference_time(model, dataset):\n",
    "    inference_time = []\n",
    "    for data_item in tqdm(dataset[:100], desc=\"Measuring performance\"):\n",
    "        start = time.perf_counter()\n",
    "        model(data_item)\n",
    "        end = time.perf_counter()\n",
    "        delta = end - start\n",
    "        inference_time.append(delta)\n",
    "    return np.median(inference_time)\n",
    "\n",
    "\n",
    "encoder_time_fp32 = calculate_call_inference_time(model_fp32.encoder.compiled_model, encoder_calibration_data)\n",
    "encoder_time_int8 = calculate_call_inference_time(model_int8.encoder.compiled_model, encoder_calibration_data)\n",
    "print(f\"Encoder performance speedup: {encoder_time_fp32 / encoder_time_int8:.3f}\")\n",
    "\n",
    "decoder_time_fp32 = calculate_call_inference_time(model_fp32.decoder.compiled_model, decoder_calibration_data)\n",
    "decoder_time_int8 = calculate_call_inference_time(model_int8.decoder.compiled_model, decoder_calibration_data)\n",
    "print(f\"Decoder performance speedup: {decoder_time_fp32 / decoder_time_int8:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "We measure the whole transcription performance separately, because a single Whisper `transcribe()` call triggers multiple encoder and decoder inference calls.\n",
    "And the number of these calls is dynamic depending on the model accuracy.\n",
    "In this experiment we use the mean time instead of the median because the model transcription time is less uniform.\n",
    "\n",
    "We also compare accuracy values of the `FP32` and `INT8` models on a subset of [librispeech_asr](https://huggingface.co/datasets/librispeech_asr) test dataset.\n",
    "We rely on the Word Error Rate (WER) metric and compute accuracy as `(1 - WER)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "test_replace": {
     "TEST_DATASET_SIZE = 100": "TEST_DATASET_SIZE = 1"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "96b540d63cbf48eca2a28b485e82794d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Measuring performance and accuracy:   0%|          | 0/100 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "cbe2031ad02d42789059b537deffaf39",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Measuring performance and accuracy:   0%|          | 0/100 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Whisper transcription performance speedup: 1.799\n",
      "Whisper transcription word accuracy. FP32: 98.41%. INT8: 97.51%. Accuracy drop :0.90%.\n"
     ]
    }
   ],
   "source": [
    "from evaluate import load\n",
    "from transformers import WhisperProcessor\n",
    "\n",
    "wer = load(\"wer\")\n",
    "\n",
    "TEST_DATASET_SIZE = 100\n",
    "test_dataset = load_dataset(\"librispeech_asr\", \"clean\", split=\"test\", streaming=True).take(TEST_DATASET_SIZE)\n",
    "\n",
    "\n",
    "def calculate_transcription_time_and_accuracy(model, dataset):\n",
    "    processor = WhisperProcessor.from_pretrained(\"openai/whisper-large\")\n",
    "\n",
    "    ground_truths = []\n",
    "    predictions = []\n",
    "    inference_time = []\n",
    "    for data_item in tqdm(dataset, desc=\"Measuring performance and accuracy\", total=TEST_DATASET_SIZE):\n",
    "        audio = data_item[\"audio\"][\"array\"].astype(\"float32\")\n",
    "\n",
    "        start_time = time.perf_counter()\n",
    "        transcription = model.transcribe(audio, task=task.value)\n",
    "        end_time = time.perf_counter()\n",
    "        delta_time = end_time - start_time\n",
    "\n",
    "        reference = processor.tokenizer._normalize(data_item[\"text\"])\n",
    "        prediction = processor.tokenizer._normalize(transcription[\"text\"])\n",
    "        ground_truths.append(reference)\n",
    "        predictions.append(prediction)\n",
    "        inference_time.append(delta_time)\n",
    "\n",
    "    word_accuracy = (1 - wer.compute(references=ground_truths, predictions=predictions)) * 100\n",
    "    mean_inference_time = np.mean(inference_time)\n",
    "    return mean_inference_time, word_accuracy\n",
    "\n",
    "\n",
    "transcription_time_fp32, accuracy_fp32 = calculate_transcription_time_and_accuracy(model_fp32, test_dataset)\n",
    "transcription_time_int8, accuracy_int8 = calculate_transcription_time_and_accuracy(model_int8, test_dataset)\n",
    "print(f\"Whisper transcription performance speedup: {transcription_time_fp32 / transcription_time_int8:.3f}\")\n",
    "print(f\"Whisper transcription word accuracy. FP32: {accuracy_fp32:.2f}%. INT8: {accuracy_int8:.2f}%. Accuracy drop :{accuracy_fp32 - accuracy_int8:.2f}%.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "> **NOTE**: Accuracy drop can generally be improved by increasing calibration dataset size."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "openvino_notebooks": {
   "imageUrl": "https://user-images.githubusercontent.com/29454499/204548693-1304ef33-c790-490d-8a8b-d5766acb6254.png",
   "tags": {
    "categories": [
     "Model Demos",
     "AI Trends"
    ],
    "libraries": [],
    "other": [],
    "tasks": [
     "Speech Recognition"
    ]
   }
  },
  "vscode": {
   "interpreter": {
    "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
   }
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {
     "2e6dde4a5c3344b88772ddc9b5d94510": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "DropdownModel",
      "state": {
       "_options_labels": [
        "large-v2",
        "large-v3"
       ],
       "description": "Model:",
       "index": 0,
       "layout": "IPY_MODEL_b01054569522434d865aed9f2e2c0f83",
       "style": "IPY_MODEL_68ab334740754f9591d40c5bfdad8128"
      }
     },
     "68ab334740754f9591d40c5bfdad8128": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "DescriptionStyleModel",
      "state": {
       "description_width": ""
      }
     },
     "b01054569522434d865aed9f2e2c0f83": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {}
     }
    },
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}