{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "c87087c91122e3f8", "metadata": { "collapsed": false }, "source": [ "# MMS: Scaling Speech Technology to 1000+ languages with OpenVINO™\n", "\n", "The Massively Multilingual Speech (MMS) project expands speech technology from about 100 languages to over 1,000 by building a single multilingual speech recognition model supporting over 1,100 languages (more than 10 times as many as before), language identification models able to identify over 4,000 languages (40 times more than before), pretrained models supporting over 1,400 languages, and text-to-speech models for over 1,100 languages.\n", "\n", "The MMS model was proposed in [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516). The models and code are originally released [here](https://github.com/facebookresearch/fairseq/tree/main/examples/mms).\n", "\n", "There are different open sourced models in the MMS project: Automatic Speech Recognition (ASR), Language Identification (LID) and Speech Synthesis (TTS). A simple diagram of this is below.\n", "\n", "![LID and ASR flow](https://github.com/openvinotoolkit/openvino_notebooks/assets/76171391/0e7fadd6-29a8-4fac-bd9c-41d66adcb045)\n", "\n", "In this notebook we are considering ASR and LID. We will use LID model to identify language, and then language-specific ASR model to recognize it. Additional models quantization step is employed to improve models inference speed. In the end of the notebook there's a Gradio-based interactive demo." ] }, { "attachments": {}, "cell_type": "markdown", "id": "fa80166a11177e7a", "metadata": { "collapsed": false }, "source": [ "\n", "#### Table of contents:\n", "\n", "- [Prerequisites](#Prerequisites)\n", "- [Prepare an example audio](#Prepare-an-example-audio)\n", "- [Language Identification (LID)](#Language-Identification-(LID))\n", " - [Download pretrained model and processor](#Download-pretrained-model-and-processor)\n", " - [Use the original model to run an inference](#Use-the-original-model-to-run-an-inference)\n", " - [Convert to OpenVINO IR model and run an inference](#Convert-to-OpenVINO-IR-model-and-run-an-inference)\n", "- [Automatic Speech Recognition (ASR)](#Automatic-Speech-Recognition-(ASR))\n", " - [Download pretrained model and processor](#Download-pretrained-model-and-processor)\n", " - [Use the original model for inference](#Use-the-original-model-for-inference)\n", " - [Convert to OpenVINO IR model and run inference](#Convert-to-OpenVINO-IR-model-and-run-inference)\n", "- [Quantization](#Quantization)\n", " - [Preparing calibration dataset](#Preparing-calibration-dataset)\n", " - [Language identification model quantization](#Language-identification-model-quantization)\n", " - [Speech recognition model quantization](#Speech-recognition-model-quantization)\n", " - [Compare model size, performance and accuracy](#Compare-model-size,-performance-and-accuracy)\n", "- [Interactive demo with Gradio](#Interactive-demo-with-Gradio)\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "90c7a208b1fa497b", "metadata": { "collapsed": false }, "source": [ "## Prerequisites\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "bc1a0304b8213aa", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:54:47.440197100Z", "start_time": "2023-10-12T15:54:46.774028500Z" }, "collapsed": false }, "outputs": [], "source": [ "%pip install -q --upgrade pip\n", "%pip install -q \"transformers>=4.33.1\" \"torch>=2.1\" \"openvino>=2023.1.0\" \"numpy>=1.21.0\" \"nncf>=2.9.0\"\n", "%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu torch \"datasets>=2.14.6\" accelerate soundfile librosa \"gradio>=4.19\" jiwer" ] }, { "cell_type": "code", "execution_count": 2, "id": "dbac6fae86122d9b", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:54:47.591931700Z", "start_time": "2023-10-12T15:54:46.786966800Z" }, "collapsed": false }, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "import torch\n", "\n", "import openvino as ov" ] }, { "attachments": {}, "cell_type": "markdown", "id": "8d81ab16ec40431a", "metadata": { "collapsed": false }, "source": [ "## Prepare an example audio\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Read an audio file and process the audio data. Make sure that the audio data is sampled to 16000 kHz.\n", "For this example we will use [a streamable version of the Multilingual LibriSpeech (MLS) dataset](https://huggingface.co/datasets/multilingual_librispeech). It supports contains example on 7 languages: `'german', 'dutch', 'french', 'spanish', 'italian', 'portuguese', 'polish'`.\n", "Choose one of them." ] }, { "cell_type": "code", "execution_count": 3, "id": "d46064f030034ef0", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:54:47.591931700Z", "start_time": "2023-10-12T15:54:47.575834800Z" } }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2bddaa41494c4bc8aaaf7bb3d3c394a4", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Dropdown(description='Dataset language:', options=('german', 'dutch', 'french', 'spanish', 'italian', 'portugu…" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import ipywidgets as widgets\n", "\n", "\n", "SAMPLE_LANG = widgets.Dropdown(\n", " options=[\"german\", \"dutch\", \"french\", \"spanish\", \"italian\", \"portuguese\", \"polish\"],\n", " value=\"german\",\n", " description=\"Dataset language:\",\n", " disabled=False,\n", ")\n", "\n", "SAMPLE_LANG" ] }, { "attachments": {}, "cell_type": "markdown", "id": "62f4f25bd4987849", "metadata": { "collapsed": false }, "source": [ "Specify `streaming=True` to not download the entire dataset." ] }, { "cell_type": "code", "execution_count": 4, "id": "3e3b30952e08ee76", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:54:53.101990700Z", "start_time": "2023-10-12T15:54:47.575834800Z" }, "collapsed": false }, "outputs": [], "source": [ "from datasets import load_dataset\n", "\n", "\n", "mls_dataset = load_dataset(\"facebook/multilingual_librispeech\", SAMPLE_LANG.value, split=\"test\", streaming=True)\n", "mls_dataset = iter(mls_dataset) # make it iterable\n", "\n", "example = next(mls_dataset) # get one example" ] }, { "attachments": {}, "cell_type": "markdown", "id": "68f9bb826d9a36dd", "metadata": { "collapsed": false }, "source": [ "Example has a dictionary structure. It contains an audio data and a text transcription." ] }, { "cell_type": "code", "execution_count": 5, "id": "53d4ee3f9e30aacf", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:54:53.106498900Z", "start_time": "2023-10-12T15:54:53.101990700Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'file': None, 'audio': {'path': '1054_1599_000000.flac', 'array': array([-0.00131226, -0.00152588, -0.00134277, ..., 0.00411987,\n", " 0.00308228, -0.00015259]), 'sampling_rate': 16000}, 'text': 'mein sechster sohn scheint wenigstens auf den ersten blick der tiefsinnigste von allen ein kopfhänger und doch ein schwätzer deshalb kommt man ihm nicht leicht bei ist er am unterliegen so verfällt er in unbesiegbare traurigkeit', 'speaker_id': 1054, 'chapter_id': 1599, 'id': '1054_1599_000000'}\n" ] } ], "source": [ "print(example) # look at structure" ] }, { "cell_type": "code", "execution_count": 6, "id": "5f96bfecfd4bab51", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:54:53.320425400Z", "start_time": "2023-10-12T15:54:53.106498900Z" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mein sechster sohn scheint wenigstens auf den ersten blick der tiefsinnigste von allen ein kopfhänger und doch ein schwätzer deshalb kommt man ihm nicht leicht bei ist er am unterliegen so verfällt er in unbesiegbare traurigkeit\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "\n", "print(example[\"text\"])\n", "ipd.Audio(example[\"audio\"][\"array\"], rate=16_000)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "86963727a1d32e5a", "metadata": { "collapsed": false }, "source": [ "## Language Identification (LID) \n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "cb607febc51e3782", "metadata": { "collapsed": false }, "source": [ "### Download pretrained model and processor\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Different LID models are available based on the number of languages they can recognize - 126, 256, 512, 1024, 2048, 4017. We will use 126." ] }, { "cell_type": "code", "execution_count": 7, "id": "1995f9336132be61", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:54:59.110836600Z", "start_time": "2023-10-12T15:54:53.294937500Z" }, "collapsed": false }, "outputs": [], "source": [ "from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor\n", "\n", "model_id = \"facebook/mms-lid-126\"\n", "\n", "lid_processor = AutoFeatureExtractor.from_pretrained(model_id)\n", "lid_model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "100d4f9dfff9a7d3", "metadata": { "collapsed": false }, "source": [ "### Use the original model to run an inference\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "ef184f78ef5f39c0", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:02.814861200Z", "start_time": "2023-10-12T15:54:59.111671500Z" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "deu\n" ] } ], "source": [ "inputs = lid_processor(example[\"audio\"][\"array\"], sampling_rate=16_000, return_tensors=\"pt\")\n", "\n", "with torch.no_grad():\n", " outputs = lid_model(**inputs).logits\n", "\n", "lang_id = torch.argmax(outputs, dim=-1)[0].item()\n", "detected_lang = lid_model.config.id2label[lang_id]\n", "print(detected_lang)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9bc6f53041bf77e4", "metadata": { "collapsed": false }, "source": [ "### Convert to OpenVINO IR model and run an inference\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "2fb627d3", "metadata": { "collapsed": false }, "source": [ "Select device from dropdown list for running inference using OpenVINO" ] }, { "cell_type": "code", "execution_count": 9, "id": "a71adf13", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:02.914590700Z", "start_time": "2023-10-12T15:55:02.908879300Z" }, "collapsed": false }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "de3a5c59a5d34c72accfc6f4a87bacae", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "core = ov.Core()\n", "\n", "device = widgets.Dropdown(\n", " options=core.available_devices + [\"AUTO\"],\n", " value=\"AUTO\",\n", " description=\"Device:\",\n", " disabled=False,\n", ")\n", "\n", "device" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ca15564e", "metadata": { "collapsed": false }, "source": [ "Convert model to OpenVINO format and compile it" ] }, { "cell_type": "code", "execution_count": 10, "id": "c79ba406", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:12.102555300Z", "start_time": "2023-10-12T15:55:02.924532500Z" }, "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/nsavel/venvs/ov_notebooks_tmp/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py:595: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", " if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):\n", "/home/nsavel/venvs/ov_notebooks_tmp/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py:634: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", " if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):\n" ] } ], "source": [ "MAX_SEQ_LENGTH = 30480\n", "\n", "lid_model_xml_path = Path(\"models/ov_lid_model.xml\")\n", "\n", "\n", "def get_lid_model(model_path, compiled=True):\n", " input_values = torch.zeros([1, MAX_SEQ_LENGTH], dtype=torch.float)\n", "\n", " if not model_path.exists() and model_path == lid_model_xml_path:\n", " lid_model_xml_path.parent.mkdir(parents=True, exist_ok=True)\n", " converted_model = ov.convert_model(lid_model, example_input={\"input_values\": input_values})\n", " ov.save_model(converted_model, lid_model_xml_path)\n", " if not compiled:\n", " return converted_model\n", " if compiled:\n", " return core.compile_model(model_path, device_name=device.value)\n", " return core.read_model(model_path)\n", "\n", "\n", "compiled_lid_model = get_lid_model(lid_model_xml_path)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "40193d2a396bb746", "metadata": { "collapsed": false }, "source": [ "Now it is possible to run an inference. " ] }, { "cell_type": "code", "execution_count": 11, "id": "a5d96a19f0504f3d", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:12.119092Z", "start_time": "2023-10-12T15:55:12.119092Z" }, "collapsed": false }, "outputs": [], "source": [ "def detect_language(compiled_model, audio_data):\n", " inputs = lid_processor(audio_data, sampling_rate=16_000, return_tensors=\"pt\")\n", "\n", " outputs = compiled_model(inputs[\"input_values\"])[0]\n", "\n", " lang_id = torch.argmax(torch.from_numpy(outputs), dim=-1)[0].item()\n", " detected_lang = lid_model.config.id2label[lang_id]\n", "\n", " return detected_lang" ] }, { "cell_type": "code", "execution_count": 12, "id": "dcaae46ecd2077b8", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:13.705838100Z", "start_time": "2023-10-12T15:55:12.119092Z" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'deu'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "detect_language(compiled_lid_model, example[\"audio\"][\"array\"])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "346a0954d96d40df", "metadata": { "collapsed": false }, "source": [ "Let's check another language." ] }, { "cell_type": "code", "execution_count": 13, "id": "8e89c90f-d6f0-4dc6-ba3d-34a4be710a44", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:13.721895900Z", "start_time": "2023-10-12T15:55:13.712315400Z" } }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f2511e43d090485998c2ac350e715a7a", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Dropdown(description='Dataset language:', index=2, options=('german', 'dutch', 'french', 'spanish', 'italian',…" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "SAMPLE_LANG = widgets.Dropdown(\n", " options=[\"german\", \"dutch\", \"french\", \"spanish\", \"italian\", \"portuguese\", \"polish\"],\n", " value=\"french\",\n", " description=\"Dataset language:\",\n", " disabled=False,\n", ")\n", "\n", "SAMPLE_LANG" ] }, { "cell_type": "code", "execution_count": 14, "id": "7e4b10f76be235ad", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:17.815597900Z", "start_time": "2023-10-12T15:55:13.721895900Z" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "grisé par ce parfum il fit des vers en l'honneur de l'humble fleur des bois et il les récita tout haut à ses pieds une violette l'entendit elle crut qu'il ne parlait que pour elle\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mls_dataset = load_dataset(\"facebook/multilingual_librispeech\", SAMPLE_LANG.value, split=\"test\", streaming=True)\n", "mls_dataset = iter(mls_dataset)\n", "\n", "example = next(mls_dataset)\n", "print(example[\"text\"])\n", "ipd.Audio(example[\"audio\"][\"array\"], rate=16_000)" ] }, { "cell_type": "code", "execution_count": 15, "id": "67f764403640f618", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:18.506184200Z", "start_time": "2023-10-12T15:55:17.815597900Z" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fra\n" ] } ], "source": [ "language_id = detect_language(compiled_lid_model, example[\"audio\"][\"array\"])\n", "print(language_id)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "e010ed384d1e8ee7", "metadata": { "collapsed": false }, "source": [ "## Automatic Speech Recognition (ASR)\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "fe4536f63fe7e612", "metadata": { "collapsed": false }, "source": [ "### Download pretrained model and processor\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Download pretrained model and processor. By default, MMS loads adapter weights for English. If you want to load adapter weights of another language make sure to specify `target_lang=` as well as `ignore_mismatched_sizes=True`. The `ignore_mismatched_sizes=True` keyword has to be passed to allow the language model head to be resized according to the vocabulary of the specified language. Similarly, the processor should be loaded with the same target language. \n", "It is also possible to change the supported language later." ] }, { "cell_type": "code", "execution_count": 16, "id": "2b104f835667fb9a", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:24.840244900Z", "start_time": "2023-10-12T15:55:18.506184200Z" } }, "outputs": [], "source": [ "from transformers import Wav2Vec2ForCTC, AutoProcessor\n", "\n", "model_id = \"facebook/mms-1b-all\"\n", "\n", "asr_processor = AutoProcessor.from_pretrained(model_id)\n", "asr_model = Wav2Vec2ForCTC.from_pretrained(model_id)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5896f5fd08f62071", "metadata": { "collapsed": false }, "source": [ "You can look at all supported languages:" ] }, { "cell_type": "code", "execution_count": 17, "id": "6b62341511f98ceb", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:24.860305100Z", "start_time": "2023-10-12T15:55:24.845930900Z" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "dict_keys(['abi', 'abk', 'abp', 'aca', 'acd', 'ace', 'acf', 'ach', 'acn', 'acr', 'acu', 'ade', 'adh', 'adj', 'adx', 'aeu', 'afr', 'agd', 'agg', 'agn', 'agr', 'agu', 'agx', 'aha', 'ahk', 'aia', 'aka', 'akb', 'ake', 'akp', 'alj', 'alp', 'alt', 'alz', 'ame', 'amf', 'amh', 'ami', 'amk', 'ann', 'any', 'aoz', 'apb', 'apr', 'ara', 'arl', 'asa', 'asg', 'asm', 'ast', 'ata', 'atb', 'atg', 'ati', 'atq', 'ava', 'avn', 'avu', 'awa', 'awb', 'ayo', 'ayr', 'ayz', 'azb', 'azg', 'azj-script_cyrillic', 'azj-script_latin', 'azz', 'bak', 'bam', 'ban', 'bao', 'bas', 'bav', 'bba', 'bbb', 'bbc', 'bbo', 'bcc-script_arabic', 'bcc-script_latin', 'bcl', 'bcw', 'bdg', 'bdh', 'bdq', 'bdu', 'bdv', 'beh', 'bel', 'bem', 'ben', 'bep', 'bex', 'bfa', 'bfo', 'bfy', 'bfz', 'bgc', 'bgq', 'bgr', 'bgt', 'bgw', 'bha', 'bht', 'bhz', 'bib', 'bim', 'bis', 'biv', 'bjr', 'bjv', 'bjw', 'bjz', 'bkd', 'bkv', 'blh', 'blt', 'blx', 'blz', 'bmq', 'bmr', 'bmu', 'bmv', 'bng', 'bno', 'bnp', 'boa', 'bod', 'boj', 'bom', 'bor', 'bos', 'bov', 'box', 'bpr', 'bps', 'bqc', 'bqi', 'bqj', 'bqp', 'bre', 'bru', 'bsc', 'bsq', 'bss', 'btd', 'bts', 'btt', 'btx', 'bud', 'bul', 'bus', 'bvc', 'bvz', 'bwq', 'bwu', 'byr', 'bzh', 'bzi', 'bzj', 'caa', 'cab', 'cac-dialect_sanmateoixtatan', 'cac-dialect_sansebastiancoatan', 'cak-dialect_central', 'cak-dialect_santamariadejesus', 'cak-dialect_santodomingoxenacoj', 'cak-dialect_southcentral', 'cak-dialect_western', 'cak-dialect_yepocapa', 'cap', 'car', 'cas', 'cat', 'cax', 'cbc', 'cbi', 'cbr', 'cbs', 'cbt', 'cbu', 'cbv', 'cce', 'cco', 'cdj', 'ceb', 'ceg', 'cek', 'ces', 'cfm', 'cgc', 'che', 'chf', 'chv', 'chz', 'cjo', 'cjp', 'cjs', 'ckb', 'cko', 'ckt', 'cla', 'cle', 'cly', 'cme', 'cmn-script_simplified', 'cmo-script_khmer', 'cmo-script_latin', 'cmr', 'cnh', 'cni', 'cnl', 'cnt', 'coe', 'cof', 'cok', 'con', 'cot', 'cou', 'cpa', 'cpb', 'cpu', 'crh', 'crk-script_latin', 'crk-script_syllabics', 'crn', 'crq', 'crs', 'crt', 'csk', 'cso', 'ctd', 'ctg', 'cto', 'ctu', 'cuc', 'cui', 'cuk', 'cul', 'cwa', 'cwe', 'cwt', 'cya', 'cym', 'daa', 'dah', 'dan', 'dar', 'dbj', 'dbq', 'ddn', 'ded', 'des', 'deu', 'dga', 'dgi', 'dgk', 'dgo', 'dgr', 'dhi', 'did', 'dig', 'dik', 'dip', 'div', 'djk', 'dnj-dialect_blowowest', 'dnj-dialect_gweetaawueast', 'dnt', 'dnw', 'dop', 'dos', 'dsh', 'dso', 'dtp', 'dts', 'dug', 'dwr', 'dyi', 'dyo', 'dyu', 'dzo', 'eip', 'eka', 'ell', 'emp', 'enb', 'eng', 'enx', 'epo', 'ese', 'ess', 'est', 'eus', 'evn', 'ewe', 'eza', 'fal', 'fao', 'far', 'fas', 'fij', 'fin', 'flr', 'fmu', 'fon', 'fra', 'frd', 'fry', 'ful', 'gag-script_cyrillic', 'gag-script_latin', 'gai', 'gam', 'gau', 'gbi', 'gbk', 'gbm', 'gbo', 'gde', 'geb', 'gej', 'gil', 'gjn', 'gkn', 'gld', 'gle', 'glg', 'glk', 'gmv', 'gna', 'gnd', 'gng', 'gof-script_latin', 'gog', 'gor', 'gqr', 'grc', 'gri', 'grn', 'grt', 'gso', 'gub', 'guc', 'gud', 'guh', 'guj', 'guk', 'gum', 'guo', 'guq', 'guu', 'gux', 'gvc', 'gvl', 'gwi', 'gwr', 'gym', 'gyr', 'had', 'hag', 'hak', 'hap', 'hat', 'hau', 'hay', 'heb', 'heh', 'hif', 'hig', 'hil', 'hin', 'hlb', 'hlt', 'hne', 'hnn', 'hns', 'hoc', 'hoy', 'hrv', 'hsb', 'hto', 'hub', 'hui', 'hun', 'hus-dialect_centralveracruz', 'hus-dialect_westernpotosino', 'huu', 'huv', 'hvn', 'hwc', 'hye', 'hyw', 'iba', 'ibo', 'icr', 'idd', 'ifa', 'ifb', 'ife', 'ifk', 'ifu', 'ify', 'ign', 'ikk', 'ilb', 'ilo', 'imo', 'ina', 'inb', 'ind', 'iou', 'ipi', 'iqw', 'iri', 'irk', 'isl', 'ita', 'itl', 'itv', 'ixl-dialect_sangasparchajul', 'ixl-dialect_sanjuancotzal', 'ixl-dialect_santamarianebaj', 'izr', 'izz', 'jac', 'jam', 'jav', 'jbu', 'jen', 'jic', 'jiv', 'jmc', 'jmd', 'jpn', 'jun', 'juy', 'jvn', 'kaa', 'kab', 'kac', 'kak', 'kam', 'kan', 'kao', 'kaq', 'kat', 'kay', 'kaz', 'kbo', 'kbp', 'kbq', 'kbr', 'kby', 'kca', 'kcg', 'kdc', 'kde', 'kdh', 'kdi', 'kdj', 'kdl', 'kdn', 'kdt', 'kea', 'kek', 'ken', 'keo', 'ker', 'key', 'kez', 'kfb', 'kff-script_telugu', 'kfw', 'kfx', 'khg', 'khm', 'khq', 'kia', 'kij', 'kik', 'kin', 'kir', 'kjb', 'kje', 'kjg', 'kjh', 'kki', 'kkj', 'kle', 'klu', 'klv', 'klw', 'kma', 'kmd', 'kml', 'kmr-script_arabic', 'kmr-script_cyrillic', 'kmr-script_latin', 'kmu', 'knb', 'kne', 'knf', 'knj', 'knk', 'kno', 'kog', 'kor', 'kpq', 'kps', 'kpv', 'kpy', 'kpz', 'kqe', 'kqp', 'kqr', 'kqy', 'krc', 'kri', 'krj', 'krl', 'krr', 'krs', 'kru', 'ksb', 'ksr', 'kss', 'ktb', 'ktj', 'kub', 'kue', 'kum', 'kus', 'kvn', 'kvw', 'kwd', 'kwf', 'kwi', 'kxc', 'kxf', 'kxm', 'kxv', 'kyb', 'kyc', 'kyf', 'kyg', 'kyo', 'kyq', 'kyu', 'kyz', 'kzf', 'lac', 'laj', 'lam', 'lao', 'las', 'lat', 'lav', 'law', 'lbj', 'lbw', 'lcp', 'lee', 'lef', 'lem', 'lew', 'lex', 'lgg', 'lgl', 'lhu', 'lia', 'lid', 'lif', 'lin', 'lip', 'lis', 'lit', 'lje', 'ljp', 'llg', 'lln', 'lme', 'lnd', 'lns', 'lob', 'lok', 'lom', 'lon', 'loq', 'lsi', 'lsm', 'ltz', 'luc', 'lug', 'luo', 'lwo', 'lww', 'lzz', 'maa-dialect_sanantonio', 'maa-dialect_sanjeronimo', 'mad', 'mag', 'mah', 'mai', 'maj', 'mak', 'mal', 'mam-dialect_central', 'mam-dialect_northern', 'mam-dialect_southern', 'mam-dialect_western', 'maq', 'mar', 'maw', 'maz', 'mbb', 'mbc', 'mbh', 'mbj', 'mbt', 'mbu', 'mbz', 'mca', 'mcb', 'mcd', 'mco', 'mcp', 'mcq', 'mcu', 'mda', 'mdf', 'mdv', 'mdy', 'med', 'mee', 'mej', 'men', 'meq', 'met', 'mev', 'mfe', 'mfh', 'mfi', 'mfk', 'mfq', 'mfy', 'mfz', 'mgd', 'mge', 'mgh', 'mgo', 'mhi', 'mhr', 'mhu', 'mhx', 'mhy', 'mib', 'mie', 'mif', 'mih', 'mil', 'mim', 'min', 'mio', 'mip', 'miq', 'mit', 'miy', 'miz', 'mjl', 'mjv', 'mkd', 'mkl', 'mkn', 'mlg', 'mlt', 'mmg', 'mnb', 'mnf', 'mnk', 'mnw', 'mnx', 'moa', 'mog', 'mon', 'mop', 'mor', 'mos', 'mox', 'moz', 'mpg', 'mpm', 'mpp', 'mpx', 'mqb', 'mqf', 'mqj', 'mqn', 'mri', 'mrw', 'msy', 'mtd', 'mtj', 'mto', 'muh', 'mup', 'mur', 'muv', 'muy', 'mvp', 'mwq', 'mwv', 'mxb', 'mxq', 'mxt', 'mxv', 'mya', 'myb', 'myk', 'myl', 'myv', 'myx', 'myy', 'mza', 'mzi', 'mzj', 'mzk', 'mzm', 'mzw', 'nab', 'nag', 'nan', 'nas', 'naw', 'nca', 'nch', 'ncj', 'ncl', 'ncu', 'ndj', 'ndp', 'ndv', 'ndy', 'ndz', 'neb', 'new', 'nfa', 'nfr', 'nga', 'ngl', 'ngp', 'ngu', 'nhe', 'nhi', 'nhu', 'nhw', 'nhx', 'nhy', 'nia', 'nij', 'nim', 'nin', 'nko', 'nlc', 'nld', 'nlg', 'nlk', 'nmz', 'nnb', 'nno', 'nnq', 'nnw', 'noa', 'nob', 'nod', 'nog', 'not', 'npi', 'npl', 'npy', 'nso', 'nst', 'nsu', 'ntm', 'ntr', 'nuj', 'nus', 'nuz', 'nwb', 'nxq', 'nya', 'nyf', 'nyn', 'nyo', 'nyy', 'nzi', 'obo', 'oci', 'ojb-script_latin', 'ojb-script_syllabics', 'oku', 'old', 'omw', 'onb', 'ood', 'orm', 'ory', 'oss', 'ote', 'otq', 'ozm', 'pab', 'pad', 'pag', 'pam', 'pan', 'pao', 'pap', 'pau', 'pbb', 'pbc', 'pbi', 'pce', 'pcm', 'peg', 'pez', 'pib', 'pil', 'pir', 'pis', 'pjt', 'pkb', 'pls', 'plw', 'pmf', 'pny', 'poh-dialect_eastern', 'poh-dialect_western', 'poi', 'pol', 'por', 'poy', 'ppk', 'pps', 'prf', 'prk', 'prt', 'pse', 'pss', 'ptu', 'pui', 'pus', 'pwg', 'pww', 'pxm', 'qub', 'quc-dialect_central', 'quc-dialect_east', 'quc-dialect_north', 'quf', 'quh', 'qul', 'quw', 'quy', 'quz', 'qvc', 'qve', 'qvh', 'qvm', 'qvn', 'qvo', 'qvs', 'qvw', 'qvz', 'qwh', 'qxh', 'qxl', 'qxn', 'qxo', 'qxr', 'rah', 'rai', 'rap', 'rav', 'raw', 'rej', 'rel', 'rgu', 'rhg', 'rif-script_arabic', 'rif-script_latin', 'ril', 'rim', 'rjs', 'rkt', 'rmc-script_cyrillic', 'rmc-script_latin', 'rmo', 'rmy-script_cyrillic', 'rmy-script_latin', 'rng', 'rnl', 'roh-dialect_sursilv', 'roh-dialect_vallader', 'rol', 'ron', 'rop', 'rro', 'rub', 'ruf', 'rug', 'run', 'rus', 'sab', 'sag', 'sah', 'saj', 'saq', 'sas', 'sat', 'sba', 'sbd', 'sbl', 'sbp', 'sch', 'sck', 'sda', 'sea', 'seh', 'ses', 'sey', 'sgb', 'sgj', 'sgw', 'shi', 'shk', 'shn', 'sho', 'shp', 'sid', 'sig', 'sil', 'sja', 'sjm', 'sld', 'slk', 'slu', 'slv', 'sml', 'smo', 'sna', 'snd', 'sne', 'snn', 'snp', 'snw', 'som', 'soy', 'spa', 'spp', 'spy', 'sqi', 'sri', 'srm', 'srn', 'srp-script_cyrillic', 'srp-script_latin', 'srx', 'stn', 'stp', 'suc', 'suk', 'sun', 'sur', 'sus', 'suv', 'suz', 'swe', 'swh', 'sxb', 'sxn', 'sya', 'syl', 'sza', 'tac', 'taj', 'tam', 'tao', 'tap', 'taq', 'tat', 'tav', 'tbc', 'tbg', 'tbk', 'tbl', 'tby', 'tbz', 'tca', 'tcc', 'tcs', 'tcz', 'tdj', 'ted', 'tee', 'tel', 'tem', 'teo', 'ter', 'tes', 'tew', 'tex', 'tfr', 'tgj', 'tgk', 'tgl', 'tgo', 'tgp', 'tha', 'thk', 'thl', 'tih', 'tik', 'tir', 'tkr', 'tlb', 'tlj', 'tly', 'tmc', 'tmf', 'tna', 'tng', 'tnk', 'tnn', 'tnp', 'tnr', 'tnt', 'tob', 'toc', 'toh', 'tom', 'tos', 'tpi', 'tpm', 'tpp', 'tpt', 'trc', 'tri', 'trn', 'trs', 'tso', 'tsz', 'ttc', 'tte', 'ttq-script_tifinagh', 'tue', 'tuf', 'tuk-script_arabic', 'tuk-script_latin', 'tuo', 'tur', 'tvw', 'twb', 'twe', 'twu', 'txa', 'txq', 'txu', 'tye', 'tzh-dialect_bachajon', 'tzh-dialect_tenejapa', 'tzj-dialect_eastern', 'tzj-dialect_western', 'tzo-dialect_chamula', 'tzo-dialect_chenalho', 'ubl', 'ubu', 'udm', 'udu', 'uig-script_arabic', 'uig-script_cyrillic', 'ukr', 'umb', 'unr', 'upv', 'ura', 'urb', 'urd-script_arabic', 'urd-script_devanagari', 'urd-script_latin', 'urk', 'urt', 'ury', 'usp', 'uzb-script_cyrillic', 'uzb-script_latin', 'vag', 'vid', 'vie', 'vif', 'vmw', 'vmy', 'vot', 'vun', 'vut', 'wal-script_ethiopic', 'wal-script_latin', 'wap', 'war', 'waw', 'way', 'wba', 'wlo', 'wlx', 'wmw', 'wob', 'wol', 'wsg', 'wwa', 'xal', 'xdy', 'xed', 'xer', 'xho', 'xmm', 'xnj', 'xnr', 'xog', 'xon', 'xrb', 'xsb', 'xsm', 'xsr', 'xsu', 'xta', 'xtd', 'xte', 'xtm', 'xtn', 'xua', 'xuo', 'yaa', 'yad', 'yal', 'yam', 'yao', 'yas', 'yat', 'yaz', 'yba', 'ybb', 'ycl', 'ycn', 'yea', 'yka', 'yli', 'yor', 'yre', 'yua', 'yue-script_traditional', 'yuz', 'yva', 'zaa', 'zab', 'zac', 'zad', 'zae', 'zai', 'zam', 'zao', 'zaq', 'zar', 'zas', 'zav', 'zaw', 'zca', 'zga', 'zim', 'ziw', 'zlm', 'zmz', 'zne', 'zos', 'zpc', 'zpg', 'zpi', 'zpl', 'zpm', 'zpo', 'zpt', 'zpu', 'zpz', 'ztq', 'zty', 'zul', 'zyb', 'zyp', 'zza'])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "asr_processor.tokenizer.vocab.keys()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "541c53d1c740d668", "metadata": { "collapsed": false }, "source": [ "Switch out the language adapters by calling the `load_adapter()` function for the model and `set_target_lang()` for the tokenizer. Pass the target language as an input - `\"detect_language_id\"` which was detected in the previous step." ] }, { "cell_type": "code", "execution_count": 18, "id": "15f9f4e31170b3fa", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:25.029800800Z", "start_time": "2023-10-12T15:55:24.860305100Z" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n" ] } ], "source": [ "asr_processor.tokenizer.set_target_lang(language_id)\n", "asr_model.load_adapter(language_id)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "de68b1eac717cc26", "metadata": { "collapsed": false }, "source": [ "### Use the original model for inference\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "cell_type": "code", "execution_count": 19, "id": "4463e26404e16195", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:26.524665500Z", "start_time": "2023-10-12T15:55:25.032584500Z" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "grisé par ce parfum il fit des vers en l'honneur de l'humble fleur des bois et il les récita tout haut à ses pieds une violette l'entendit elle crut qu'il ne parlait que pour elle\n" ] } ], "source": [ "inputs = asr_processor(example[\"audio\"][\"array\"], sampling_rate=16_000, return_tensors=\"pt\")\n", "\n", "with torch.no_grad():\n", " outputs = asr_model(**inputs).logits\n", "\n", "ids = torch.argmax(outputs, dim=-1)[0]\n", "transcription = asr_processor.decode(ids)\n", "print(transcription)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "bda2f58170bfa2f4", "metadata": { "collapsed": false }, "source": [ "### Convert to OpenVINO IR model and run inference\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Convert to OpenVINO IR model format with `ov.convert_model` function directly. Use `ov.save_model` function to serialize the result of conversion. For convenience of further use, we will create a function for these purposes." ] }, { "cell_type": "code", "execution_count": 20, "id": "f47ccb726cdb505d", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:35.147762500Z", "start_time": "2023-10-12T15:55:26.524665500Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n" ] } ], "source": [ "asr_model_xml_path_template = \"models/ov_asr_{}_model.xml\"\n", "\n", "\n", "def get_asr_model(model_path_template, language_id, compiled=True):\n", " input_values = torch.zeros([1, MAX_SEQ_LENGTH], dtype=torch.float)\n", " model_path = Path(model_path_template.format(language_id))\n", "\n", " asr_processor.tokenizer.set_target_lang(language_id)\n", " if not model_path.exists() and model_path_template == asr_model_xml_path_template:\n", " asr_model.load_adapter(language_id)\n", "\n", " model_path.parent.mkdir(parents=True, exist_ok=True)\n", " converted_model = ov.convert_model(asr_model, example_input={\"input_values\": input_values})\n", " ov.save_model(converted_model, model_path)\n", " if not compiled:\n", " return converted_model\n", "\n", " if compiled:\n", " return core.compile_model(model_path, device_name=device.value)\n", " return core.read_model(model_path)\n", "\n", "\n", "compiled_asr_model = get_asr_model(asr_model_xml_path_template, language_id)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "e4fb2cd466365800", "metadata": { "collapsed": false }, "source": [ "Run inference." ] }, { "cell_type": "code", "execution_count": 21, "id": "b83689739f10f2f4", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:36.147870900Z", "start_time": "2023-10-12T15:55:35.163604800Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original text: grisé par ce parfum il fit des vers en l'honneur de l'humble fleur des bois et il les récita tout haut à ses pieds une violette l'entendit elle crut qu'il ne parlait que pour elle\n", "Transcription: grisé par ce parfum il fit des vers en l'honneur de l'humble fleur des bois et il les récita tout haut à ses pieds une violette l'entendit elle crut qu'il ne parlait que pour elle\n" ] } ], "source": [ "def recognize_audio(compiled_model, src_audio):\n", " inputs = asr_processor(src_audio, sampling_rate=16_000, return_tensors=\"pt\")\n", " outputs = compiled_model(inputs[\"input_values\"])[0]\n", "\n", " ids = torch.argmax(torch.from_numpy(outputs), dim=-1)[0]\n", " transcription = asr_processor.decode(ids)\n", "\n", " return transcription\n", "\n", "\n", "transcription = recognize_audio(compiled_asr_model, example[\"audio\"][\"array\"])\n", "print(\"Original text:\", example[\"text\"])\n", "print(\"Transcription:\", transcription)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "6c57dd01", "metadata": { "collapsed": false }, "source": [ "## Quantization\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "[NNCF](https://github.com/openvinotoolkit/nncf/) enables post-training quantization by adding quantization layers into model graph and then using a subset of the training dataset to initialize the parameters of these additional quantization layers. Quantized operations are executed in `INT8` instead of `FP32`/`FP16` making model inference faster.\n", "\n", "The optimization process contains the following steps:\n", "\n", "1. Create a calibration dataset for quantization.\n", "2. Run `nncf.quantize()` to obtain quantized models.\n", "3. Serialize quantized `INT8` model using `openvino.save_model()` function.\n", "\n", "> Note: Quantization is time and memory consuming operation. Running quantization code below may take some time." ] }, { "cell_type": "code", "execution_count": 22, "id": "5ef5a674", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:36.156165100Z", "start_time": "2023-10-12T15:55:36.148877700Z" }, "collapsed": false, "test_replace": { "value=False": "value=True" } }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "09e587014fd84f539588b73bfaf5338e", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Checkbox(value=True, description='Quantization')" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compiled_quantized_lid_model = None\n", "quantized_asr_model_xml_path_template = None\n", "\n", "to_quantize = widgets.Checkbox(\n", " value=False,\n", " description=\"Quantization\",\n", " disabled=False,\n", ")\n", "\n", "to_quantize" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9bc6116f", "metadata": { "collapsed": false }, "source": [ "Let's load skip magic extension to skip quantization if to_quantize is not selected" ] }, { "cell_type": "code", "execution_count": 23, "id": "153d38f9", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:36.170645600Z", "start_time": "2023-10-12T15:55:36.163169700Z" }, "collapsed": false }, "outputs": [], "source": [ "# Fetch `skip_kernel_extension` module\n", "import requests\n", "\n", "r = requests.get(\n", " url=\"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/skip_kernel_extension.py\",\n", ")\n", "open(\"skip_kernel_extension.py\", \"w\").write(r.text)\n", "\n", "%load_ext skip_kernel_extension" ] }, { "attachments": {}, "cell_type": "markdown", "id": "465e90d5-0397-4095-bc15-5ecde37befd1", "metadata": {}, "source": [ "### Preparing calibration dataset\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Select the language to quantize the model for:" ] }, { "cell_type": "code", "execution_count": 24, "id": "c83a711e", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:36.170645600Z", "start_time": "2023-10-12T15:55:36.164170900Z" }, "collapsed": false }, "outputs": [], "source": [ "%%skip not $to_quantize.value\n", "\n", "from IPython.display import display\n", "\n", "display(SAMPLE_LANG)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "92a36cd1-62a1-4a6e-85c3-f7b91008c1c3", "metadata": {}, "source": [ "Load validation split of the same [MLS](https://huggingface.co/datasets/multilingual_librispeech) dataset for the selected language." ] }, { "cell_type": "code", "execution_count": 25, "id": "f115ea90", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:40.725145300Z", "start_time": "2023-10-12T15:55:36.211259800Z" }, "collapsed": false }, "outputs": [], "source": [ "%%skip not $to_quantize.value\n", "\n", "mls_dataset = iter(load_dataset(\"facebook/multilingual_librispeech\", SAMPLE_LANG.value, split=\"validation\", streaming=True))\n", "example = next(mls_dataset)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "fb6be56e-6cea-49ef-8c8a-91359ee506a5", "metadata": {}, "source": [ "Create calibration dataset for quantization." ] }, { "cell_type": "code", "execution_count": 26, "id": "9a24e325-330b-4262-a1ad-86840e7d7ad7", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:40.806439900Z", "start_time": "2023-10-12T15:55:40.730350600Z" }, "test_replace": { "CALIBRATION_DATASET_SIZE = 5": "CALIBRATION_DATASET_SIZE = 1" } }, "outputs": [], "source": [ "%%skip not $to_quantize.value\n", "\n", "CALIBRATION_DATASET_SIZE = 5\n", "\n", "calibration_data = []\n", "for i in range(CALIBRATION_DATASET_SIZE):\n", " data = asr_processor(next(mls_dataset)['audio']['array'], sampling_rate=16_000, return_tensors=\"np\")\n", " calibration_data.append(data[\"input_values\"])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5f659976", "metadata": { "collapsed": false }, "source": [ "### Language identification model quantization\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Run LID model quantization." ] }, { "cell_type": "code", "execution_count": 27, "id": "7190096e", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:51.500695900Z", "start_time": "2023-10-12T15:55:40.807758Z" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, openvino\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Statistics collection: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:06<00:00, 1.24s/it]\n", "Applying Smooth Quant: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 291/291 [00:18<00:00, 15.34it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:nncf:144 ignored nodes was found by name in the NNCFGraph\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Statistics collection: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:18<00:00, 3.65s/it]\n", "Applying Fast Bias correction: 100%|██████████████████████████████████████████████████████████████████████████████████████| 298/298 [05:09<00:00, 1.04s/it]\n" ] } ], "source": [ "%%skip not $to_quantize.value\n", "\n", "import nncf\n", "\n", "quantized_lid_model_xml_path = Path(str(lid_model_xml_path).replace(\".xml\", \"_quantized.xml\"))\n", "\n", "if not quantized_lid_model_xml_path.exists():\n", " quantized_lid_model = nncf.quantize(\n", " get_lid_model(lid_model_xml_path, compiled=False),\n", " calibration_dataset=nncf.Dataset(calibration_data),\n", " subset_size=len(calibration_data),\n", " model_type=nncf.ModelType.TRANSFORMER\n", " )\n", " ov.save_model(quantized_lid_model, quantized_lid_model_xml_path)\n", " compiled_quantized_lid_model = core.compile_model(quantized_lid_model, device_name=device.value)\n", "else:\n", " compiled_quantized_lid_model = get_lid_model(quantized_lid_model_xml_path)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "dc5a9048-21b1-4939-9aa3-b48bb6e6c700", "metadata": {}, "source": [ "Detect language with the quantized model." ] }, { "cell_type": "code", "execution_count": 28, "id": "9fdd3c69", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:55:52.495642700Z", "start_time": "2023-10-12T15:55:51.504925300Z" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Detected language: fra\n" ] } ], "source": [ "%%skip not $to_quantize.value\n", "\n", "language_id = detect_language(compiled_quantized_lid_model, example['audio']['array'])\n", "print(\"Detected language:\", language_id)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "8ea8dc6b", "metadata": { "collapsed": false }, "source": [ "### Speech recognition model quantization\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Run ASR model quantization." ] }, { "cell_type": "code", "execution_count": 29, "id": "b26a8094", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:56:01.626285500Z", "start_time": "2023-10-12T15:55:52.491524500Z" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Statistics collection: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00, 1.17s/it]\n", "Applying Smooth Quant: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 290/290 [00:17<00:00, 16.39it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:nncf:144 ignored nodes was found by name in the NNCFGraph\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Statistics collection: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:19<00:00, 3.93s/it]\n", "Applying Fast Bias correction: 100%|██████████████████████████████████████████████████████████████████████████████████████| 393/393 [05:22<00:00, 1.22it/s]\n" ] } ], "source": [ "%%skip not $to_quantize.value\n", "\n", "quantized_asr_model_xml_path_template = asr_model_xml_path_template.replace(\".xml\", \"_quantized.xml\")\n", "quantized_asr_model_xml_path = Path(quantized_asr_model_xml_path_template.format(language_id))\n", "\n", "if not quantized_asr_model_xml_path.exists():\n", " quantized_asr_model = nncf.quantize(\n", " get_asr_model(asr_model_xml_path_template, language_id, compiled=False),\n", " calibration_dataset=nncf.Dataset(calibration_data),\n", " subset_size=len(calibration_data),\n", " model_type=nncf.ModelType.TRANSFORMER\n", " )\n", " ov.save_model(quantized_asr_model, quantized_asr_model_xml_path)\n", " compiled_quantized_asr_model = core.compile_model(quantized_asr_model, device_name=device.value)\n", "else:\n", " compiled_quantized_asr_model = get_asr_model(quantized_asr_model_xml_path_template, language_id)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "4776f2d6-2d34-4e4f-a999-b75647ca7c32", "metadata": {}, "source": [ "Run transcription with quantized model and compare the result to the one produced by original model." ] }, { "cell_type": "code", "execution_count": 30, "id": "fd8ec20e", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:56:12.738307100Z", "start_time": "2023-10-12T15:56:01.643402100Z" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Transcription by original model: le salon était de la plus haute magnificence dorée comme la galerie de diane aux tuileries avec des tableaux à l'huile au lombri il y avait des tâches claires dans ces tableaux julien apprit plus tard que les sujets avaient semblé peu décent à la maîtresse du logis qui avait fait corriger les tableaux\n", "Transcription by quantized model: le salon était de la plus haute magnificence doré comme la galerie de diane aux tuileries avec des tableaux à l'huile au lombri il y avait des tâches claires dans ces tableaux julien apprit plus tard que les sujets avaient semblé peu decent à la maîtresse du logis qui avait fait corriger les tableaux\n" ] } ], "source": [ "%%skip not $to_quantize.value\n", "\n", "compiled_asr_model = get_asr_model(asr_model_xml_path_template, language_id)\n", "transcription_original = recognize_audio(compiled_asr_model, example['audio']['array'])\n", "transcription_quantized = recognize_audio(compiled_quantized_asr_model, example['audio']['array'])\n", "print(\"Transcription by original model: \", transcription_original)\n", "print(\"Transcription by quantized model:\", transcription_quantized)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3d7702bf", "metadata": { "collapsed": false }, "source": [ "### Compare model size, performance and accuracy\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "First we compare model size." ] }, { "cell_type": "code", "execution_count": 31, "id": "05eebdb7", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:56:12.738307100Z", "start_time": "2023-10-12T15:56:12.738307100Z" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LID model footprint comparison:\n", " * FP32 IR model size: 1931.81 MB\n", " * INT8 IR model size: 968.96 MB\n", "ASR model footprint comparison:\n", " * FP32 IR model size: 1930.10 MB\n", " * INT8 IR model size: 968.29 MB\n" ] } ], "source": [ "%%skip not $to_quantize.value\n", "\n", "def calculate_compression_rate(model_path_ov, model_path_ov_int8, model_type):\n", " model_size_fp32 = model_path_ov.with_suffix(\".bin\").stat().st_size / 10 ** 6\n", " model_size_int8 = model_path_ov_int8.with_suffix(\".bin\").stat().st_size / 10 ** 6\n", " print(f\"{model_type} model footprint comparison:\")\n", " print(f\" * FP32 IR model size: {model_size_fp32:.2f} MB\")\n", " print(f\" * INT8 IR model size: {model_size_int8:.2f} MB\")\n", " return model_size_fp32, model_size_int8\n", "\n", "lid_model_size_fp32, lid_model_size_int8 = \\\n", " calculate_compression_rate(lid_model_xml_path, quantized_lid_model_xml_path, 'LID')\n", "asr_model_size_fp32, asr_model_size_int8 = \\\n", " calculate_compression_rate(Path(asr_model_xml_path_template.format(language_id)), quantized_asr_model_xml_path, 'ASR')" ] }, { "attachments": {}, "cell_type": "markdown", "id": "35db21f7", "metadata": { "collapsed": false }, "source": [ "Secondly we compare accuracy values of the original and quantized models on a test split of MLS dataset. We rely on the Word Error Rate (WER) metric and compute accuracy as `(1 - WER)`.\n", "\n", "We also measure inference time for both language identification and speech recognition models." ] }, { "cell_type": "code", "execution_count": 32, "id": "4efd53cf", "metadata": { "ExecuteTime": { "end_time": "2023-10-12T15:56:20.060811300Z", "start_time": "2023-10-12T15:56:12.740287600Z" }, "collapsed": false, "test_replace": { "TEST_DATASET_SIZE = 20": "TEST_DATASET_SIZE = 1" } }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2d64d44078ca4356b1f3daa64c2689a1", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Measuring performance and accuracy: 0%| | 0/20 [00:00" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "Ignored unknown kwarg option normalize\n", "WARNING:nncf:NNCF provides best results with torch==2.0.1, while current torch version is 1.13.1+cu117. If you encounter issues, consider switching to torch==2.0.1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda-11.7'\n", "/home/nsavel/venvs/ov_notebooks_tmp/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py:595: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", " if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):\n", "/home/nsavel/venvs/ov_notebooks_tmp/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py:634: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", " if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):\n" ] } ], "source": [ "import gradio as gr\n", "import librosa\n", "import time\n", "\n", "\n", "title = \"MMS with Gradio\"\n", "description = (\n", " 'Gradio Demo for MMS and OpenVINO™. Upload a source audio, then click the \"Submit\" button to detect a language ID and a transcription. '\n", " \"Make sure that the audio data is sampled to 16000 kHz. If this language has not been used before, it may take some time to prepare the ASR model.\"\n", " \"\\n\"\n", " \"> Note: In order to run quantized model to transcribe some language, first the quantized model for that specific language must be prepared.\"\n", ")\n", "\n", "\n", "current_state = {\n", " \"fp32\": {\"model\": None, \"language\": None},\n", " \"int8\": {\"model\": None, \"language\": None},\n", "}\n", "\n", "\n", "def infer(src_audio_path, quantized):\n", " src_audio, _ = librosa.load(src_audio_path)\n", " lid_model = compiled_quantized_lid_model if quantized else compiled_lid_model\n", "\n", " start_time = time.perf_counter()\n", " detected_language_id = detect_language(lid_model, src_audio)\n", " end_time = time.perf_counter()\n", " identification_delta_time = f\"{end_time - start_time:.2f}\"\n", "\n", " state = current_state[\"int8\" if quantized else \"fp32\"]\n", " if detected_language_id != state[\"language\"]:\n", " template_path = quantized_asr_model_xml_path_template if quantized else asr_model_xml_path_template\n", " try:\n", " gr.Info(f\"Loading {'quantized' if quantized else ''} ASR model for '{detected_language_id}' language. \" \"This will take some time.\")\n", " state[\"model\"] = get_asr_model(template_path, detected_language_id)\n", " state[\"language\"] = detected_language_id\n", " except RuntimeError as e:\n", " if \"Unable to read the model:\" in str(e) and quantized:\n", " raise gr.Error(f\"There is no quantized ASR model for '{detected_language_id}' language. \" \"Please run quantization for this language first.\")\n", "\n", " start_time = time.perf_counter()\n", " transcription = recognize_audio(state[\"model\"], src_audio)\n", " end_time = time.perf_counter()\n", " transcription_delta_time = f\"{end_time - start_time:.2f}\"\n", "\n", " return (\n", " detected_language_id,\n", " transcription,\n", " identification_delta_time,\n", " transcription_delta_time,\n", " )\n", "\n", "\n", "with gr.Blocks() as demo:\n", " with gr.Row():\n", " gr.Markdown(f\"# {title}\")\n", " with gr.Row():\n", " gr.Markdown(description)\n", "\n", " run_button = {True: None, False: None}\n", " detected_language = {True: None, False: None}\n", " transcription = {True: None, False: None}\n", " identification_time = {True: None, False: None}\n", " transcription_time = {True: None, False: None}\n", " for quantized in [False, True]:\n", " if quantized and not to_quantize.value:\n", " break\n", " with gr.Row():\n", " with gr.Column():\n", " if not quantized:\n", " audio = gr.Audio(label=\"Source Audio\", type=\"filepath\")\n", " run_button_name = \"Run INT8\" if quantized else \"Run FP32\" if to_quantize.value else \"Run\"\n", " run_button[quantized] = gr.Button(value=run_button_name)\n", " with gr.Column():\n", " detected_language[quantized] = gr.Textbox(label=f\"Detected language ID{' (Quantized)' if quantized else ''}\")\n", " transcription[quantized] = gr.Textbox(label=f\"Transcription{' (Quantized)' if quantized else ''}\")\n", " identification_time[quantized] = gr.Textbox(label=f\"Identification time{' (Quantized)' if quantized else ''}\")\n", " transcription_time[quantized] = gr.Textbox(label=f\"Transcription time{' (Quantized)' if quantized else ''}\")\n", "\n", " run_button[False].click(\n", " infer,\n", " inputs=[audio, gr.Number(0, visible=False)],\n", " outputs=[\n", " detected_language[False],\n", " transcription[False],\n", " identification_time[False],\n", " transcription_time[False],\n", " ],\n", " )\n", " if to_quantize.value:\n", " run_button[True].click(\n", " infer,\n", " inputs=[audio, gr.Number(1, visible=False)],\n", " outputs=[\n", " detected_language[True],\n", " transcription[True],\n", " identification_time[True],\n", " transcription_time[True],\n", " ],\n", " )\n", "\n", "\n", "try:\n", " demo.queue().launch(debug=True)\n", "except Exception:\n", " demo.queue().launch(share=True, debug=True)\n", "# if you are launching remotely, specify server_name and server_port\n", "# demo.launch(server_name='your server name', server_port='server port in int')\n", "# Read more in the docs: https://gradio.app/docs/" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "openvino_notebooks": { "imageUrl": "", "tags": { "categories": [ "Model Demos" ], "libraries": [], "other": [], "tasks": [ "Audio Classification", "Speech Recognition" ] } }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }