{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "9084467f-2e18-463e-8099-4920525964f6",
"metadata": {},
"source": [
"# Speaker diarization\n",
"\n",
"Speaker diarization is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity. It is used to answer the question \"who spoke when?\"\n",
"\n",
"\n",
"\n",
"With the increasing number of broadcasts, meeting recordings and voice mail collected every year, speaker diarization has received much attention by the speech community. Speaker diarization is an essential feature for a speech recognition system to enrich the transcription with speaker labels.\n",
"\n",
"Speaker diarization is used to increase transcript readability and better understand what a conversation is about. Speaker diarization can help extract important points or action items from the conversation and identify who said what. It also helps to identify how many speakers were on the audio.\n",
"\n",
"This tutorial considers ways to build speaker diarization pipeline using pyannote.audio and OpenVINO. `pyannote.audio` is an open-source toolkit written in Python for speaker diarization. Based on PyTorch deep learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. You can find more information about pyannote pre-trained models in [model card](https://huggingface.co/pyannote/speaker-diarization), [repo](https://github.com/pyannote/pyannote-audio) and [paper](https://arxiv.org/abs/1911.01255).\n",
"\n",
"\n",
"#### Table of contents:\n",
"\n",
"- [Prerequisites](#Prerequisites)\n",
"- [Prepare pipeline](#Prepare-pipeline)\n",
"- [Load test audio file](#Load-test-audio-file)\n",
"- [Run inference pipeline](#Run-inference-pipeline)\n",
"- [Convert model to OpenVINO Intermediate Representation format](#Convert-model-to-OpenVINO-Intermediate-Representation-format)\n",
"- [Select inference device](#Select-inference-device)\n",
"- [Replace segmentation model with OpenVINO](#Replace-segmentation-model-with-OpenVINO)\n",
"- [Run speaker diarization with OpenVINO](#Run-speaker-diarization-with-OpenVINO)\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "eea16121-54ce-40fe-a2ac-52fdc6909256",
"metadata": {},
"source": [
"## Prerequisites\n",
"[back to top ⬆️](#Table-of-contents:)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e8ddb010-642a-478c-96a9-f72cf365328e",
"metadata": {
"tags": [
"hide-output"
]
},
"outputs": [],
"source": [
"%pip install -q --extra-index-url https://download.pytorch.org/whl/cpu \"librosa>=0.8.1\" \"matplotlib<3.8\" \"ruamel.yaml>=0.17.8,<0.17.29\" \"torch>=2.1\" tqdm torchvision torchaudio \"git+https://github.com/eaidova/pyannote-audio.git@hub0.10\" \"openvino>=2023.1.0\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "b3ca0e47-f42c-453a-8843-79fb916b4519",
"metadata": {},
"source": [
"## Prepare pipeline\n",
"[back to top ⬆️](#Table-of-contents:)\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c2b836c9-89b0-46aa-b8f8-8ec94afd42dc",
"metadata": {},
"source": [
"Traditional Speaker Diarization systems can be generalized into a five-step process:\n",
"\n",
" * **Feature extraction**: transform the raw waveform into audio features like mel spectrogram.\n",
" * **Voice activity detection**: identify the chunks in the audio where some voice activity was observed. As we are not interested in silence and noise, we ignore those irrelevant chunks.\n",
" * **Speaker change detection**: identify the speaker change points in the conversation present in the audio.\n",
" * **Speech turn representation**: encode each subchunk by creating feature representations.\n",
" * **Speech turn clustering**: cluster the subchunks based on their vector representation. Different clustering algorithms may be applied based on the availability of cluster count (k) and the embedding process of the previous step.\n",
"\n",
"The final output will be the clusters of different subchunks from the audio stream. Each cluster can be given an anonymous identifier (speaker_a, ..) and then it can be mapped with the audio stream to create the speaker-aware audio timeline.\n",
"\n",
"On the diagram, you can see a typical speaker diarization pipeline:\n",
"\n",
"\n",
"\n",
"From a simplified point of view, speaker diarization is a combination of speaker segmentation and speaker clustering. The first aims at finding speaker change points in an audio stream. The second aims at grouping together speech segments based on speaker characteristics."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "6fd236ec-c768-45f9-b3d7-f4f795c52cdd",
"metadata": {},
"source": [
"For instantiating speaker diarization pipeline with `pyannote.audio` library, we should import `Pipeline` class and use `from_pretrained` method by providing a path to the directory with pipeline configuration or identification from [HuggingFace hub](https://huggingface.co/pyannote/speaker-diarization).\n",
"\n",
">**Note**:\n",
"> This tutorial uses a non-official version of model `philschmid/pyannote-speaker-diarization-endpoint`, provided only for demo purposes.\n",
"> The original model (`pyannote/speaker-diarization`) requires you to accept the model license before downloading or using its weights, visit the [pyannote/speaker-diarization](https://huggingface.co/pyannote/speaker-diarization) to read accept the license before you proceed.\n",
">To use this model, you must be a registered user in 🤗 Hugging Face Hub. You will need to use an access token for the code below to run. For more information on access tokens, please refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).\n",
">You can log in on HuggingFace Hub in the notebook environment using the following code:\n",
"```python\n",
"\n",
"## login to huggingfacehub to get access to pre-trained model\n",
"\n",
"from huggingface_hub import notebook_login, whoami\n",
"\n",
"try:\n",
" whoami()\n",
" print('Authorization token already provided')\n",
"except OSError:\n",
" notebook_login()\n",
"```\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "0e0d4ff9-bee2-4359-a32e-0ae17b245380",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2024-03-29 15:46:40.738908: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n",
"2024-03-29 15:46:40.741381: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n",
"2024-03-29 15:46:40.770739: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"2024-03-29 15:46:40.770762: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"2024-03-29 15:46:40.770780: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"2024-03-29 15:46:40.776515: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n",
"2024-03-29 15:46:40.777699: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
"To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
"2024-03-29 15:46:41.938002: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
"/home/ea/miniconda3/lib/python3.11/site-packages/pyannote/audio/core/io.py:42: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.\n",
" torchaudio.set_audio_backend(\"soundfile\")\n"
]
}
],
"source": [
"from pyannote.audio import Pipeline\n",
"\n",
"pipeline = Pipeline.from_pretrained(\"philschmid/pyannote-speaker-diarization-endpoint\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "95a8ada2-1df8-446e-a41d-ec076e9af3e8",
"metadata": {},
"source": [
"## Load test audio file\n",
"[back to top ⬆️](#Table-of-contents:)\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "b87ae84b-ba08-47ff-a1a6-1544a36eab5e",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "1f54c11967a942dfbc8bf22f15a1a3cf",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"sample.wav: 0%| | 0.00/938k [00:00, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Fetch `notebook_utils` module\n",
"import requests\n",
"\n",
"r = requests.get(\n",
" url=\"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py\",\n",
")\n",
"\n",
"open(\"notebook_utils.py\", \"w\").write(r.text)\n",
"\n",
"from notebook_utils import download_file\n",
"\n",
"test_data_url = \"https://github.com/pyannote/pyannote-audio/raw/develop/tutorials/assets/sample.wav\"\n",
"\n",
"sample_file = \"sample.wav\"\n",
"download_file(test_data_url, \"sample.wav\")\n",
"AUDIO_FILE = {\"uri\": sample_file.replace(\".wav\", \"\"), \"audio\": sample_file}"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "d154b3eb-76b0-448a-82e5-a6c6c8318b98",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import librosa\n",
"import matplotlib.pyplot as plt\n",
"import librosa.display\n",
"import IPython.display as ipd\n",
"\n",
"\n",
"audio, sr = librosa.load(sample_file)\n",
"plt.figure(figsize=(14, 5))\n",
"librosa.display.waveshow(audio, sr=sr)\n",
"\n",
"ipd.Audio(sample_file)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "ba3bdf01-bc21-4b2e-9af7-89bf7d9315a6",
"metadata": {},
"source": [
"## Run inference pipeline\n",
"[back to top ⬆️](#Table-of-contents:)\n",
"\n",
"For running inference, we should provide a path to input audio to the pipeline"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "15694c5d-3f7e-49a5-a0ee-066fa04d5bf8",
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"import time\n",
"\n",
"start = time.perf_counter()\n",
"diarization = pipeline(AUDIO_FILE)\n",
"end = time.perf_counter()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "79c5c09c-9ff5-424c-9258-43d871e1c198",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Diarization pipeline took 14.14 s\n"
]
}
],
"source": [
"print(f\"Diarization pipeline took {end - start:.2f} s\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "386b2cdd-0684-4094-bbbf-7402dafa0616",
"metadata": {},
"source": [
"The result of running the pipeline can be represented as a diagram indicating when each person speaks."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "59c3af62-b48e-4887-9556-56f50034815a",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"diarization"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "1087f65a-30f7-427b-bc40-d2787c23dd4c",
"metadata": {},
"source": [
"We can also print each time frame and corresponding speaker:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "0c1f2eab-b001-45dd-8969-ad03e9676144",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"start=6.7s stop=7.1s speaker_SPEAKER_00\n",
"start=7.6s stop=8.6s speaker_SPEAKER_00\n",
"start=8.6s stop=10.0s speaker_SPEAKER_02\n",
"start=9.8s stop=11.0s speaker_SPEAKER_00\n",
"start=10.6s stop=14.7s speaker_SPEAKER_02\n",
"start=14.3s stop=17.9s speaker_SPEAKER_01\n",
"start=17.9s stop=21.5s speaker_SPEAKER_02\n",
"start=18.3s stop=18.4s speaker_SPEAKER_01\n",
"start=21.7s stop=28.6s speaker_SPEAKER_01\n",
"start=27.8s stop=29.5s speaker_SPEAKER_02\n"
]
}
],
"source": [
"for turn, _, speaker in diarization.itertracks(yield_label=True):\n",
" print(f\"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "a253dfe2-6b05-4bcc-a18e-8f7f7a7c7fbe",
"metadata": {},
"source": [
"## Convert model to OpenVINO Intermediate Representation format\n",
"[back to top ⬆️](#Table-of-contents:)\n",
"\n",
"For best results with OpenVINO, it is recommended to convert the model to OpenVINO IR format. OpenVINO supports PyTorch via ONNX conversion. We will use `torch.onnx.export` for exporting the ONNX model from PyTorch. We need to provide initialized model's instance and example of inputs for shape inference. We will use `ov.convert_model` functionality to convert the ONNX models. The `mo.convert_model` Python function returns an OpenVINO model ready to load on the device and start making predictions. We can save it on disk for the next usage with `ov.save_model`."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "fcdae729-8a81-4c96-8c0a-67685b6d89b4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model successfully converted to IR and saved to pyannote-segmentation.xml\n"
]
}
],
"source": [
"from pathlib import Path\n",
"import torch\n",
"import openvino as ov\n",
"\n",
"core = ov.Core()\n",
"\n",
"ov_speaker_segmentation_path = Path(\"pyannote-segmentation.xml\")\n",
"\n",
"if not ov_speaker_segmentation_path.exists():\n",
" onnx_path = ov_speaker_segmentation_path.with_suffix(\".onnx\")\n",
" torch.onnx.export(\n",
" pipeline._segmentation.model,\n",
" torch.zeros((1, 1, 80000)),\n",
" onnx_path,\n",
" input_names=[\"chunks\"],\n",
" output_names=[\"outputs\"],\n",
" dynamic_axes={\"chunks\": {0: \"batch_size\", 2: \"wave_len\"}},\n",
" )\n",
" ov_speaker_segmentation = ov.convert_model(onnx_path)\n",
" ov.save_model(ov_speaker_segmentation, str(ov_speaker_segmentation_path))\n",
" print(f\"Model successfully converted to IR and saved to {ov_speaker_segmentation_path}\")\n",
"else:\n",
" ov_speaker_segmentation = core.read_model(ov_speaker_segmentation_path)\n",
" print(f\"Model successfully loaded from {ov_speaker_segmentation_path}\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "071cbdcc-0bd4-42b0-bf73-5127f2e63b25",
"metadata": {},
"source": [
"## Select inference device\n",
"[back to top ⬆️](#Table-of-contents:)\n",
"\n",
"select device from dropdown list for running inference using OpenVINO"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "2783a6f6-a447-4abb-b2c3-b0d6918679b2",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "21d28064d0b94b558ee5c64a640df9ea",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Dropdown(description='Device:', index=3, options=('CPU', 'GPU.0', 'GPU.1', 'AUTO'), value='AUTO')"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import ipywidgets as widgets\n",
"\n",
"device = widgets.Dropdown(\n",
" options=core.available_devices + [\"AUTO\"],\n",
" value=\"AUTO\",\n",
" description=\"Device:\",\n",
" disabled=False,\n",
")\n",
"\n",
"device"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "42ff13a0-f3f1-49ea-8314-442f55292912",
"metadata": {},
"source": [
"## Replace segmentation model with OpenVINO\n",
"[back to top ⬆️](#Table-of-contents:)\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "02b53cab-e623-4e38-8974-48d1ab352228",
"metadata": {},
"outputs": [],
"source": [
"core = ov.Core()\n",
"\n",
"ov_seg_model = core.compile_model(ov_speaker_segmentation, device.value)\n",
"infer_request = ov_seg_model.create_infer_request()\n",
"ov_seg_out = ov_seg_model.output(0)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "828bf60e-2749-44da-92dc-dec0bed10c50",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"\n",
"def infer_segm(chunks: torch.Tensor) -> np.ndarray:\n",
" \"\"\"\n",
" Inference speaker segmentation mode using OpenVINO\n",
" Parameters:\n",
" chunks (torch.Tensor) input audio chunks\n",
" Return:\n",
" segments (np.ndarray)\n",
" \"\"\"\n",
" res = ov_seg_model(chunks)\n",
" return res[ov_seg_out]\n",
"\n",
"\n",
"pipeline._segmentation.infer = infer_segm"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c853d7c9-c435-4c62-b9d2-8c973c84ece0",
"metadata": {},
"source": [
"## Run speaker diarization with OpenVINO\n",
"[back to top ⬆️](#Table-of-contents:)\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "bafcf1a7-e174-42f8-8993-8e6536fdea18",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Model is not converging. Current: 16444.281598619917 is not greater than 16444.330820454463. Delta is -0.04922183454618789\n",
"Model is not converging. Current: 16444.281598619917 is not greater than 16444.330820454463. Delta is -0.04922183454618789\n"
]
}
],
"source": [
"start = time.perf_counter()\n",
"diarization = pipeline(AUDIO_FILE)\n",
"end = time.perf_counter()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "da8bbc5d-789f-4c1c-a777-4eda3c4f4e29",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Diarization pipeline took 13.58 s\n"
]
}
],
"source": [
"print(f\"Diarization pipeline took {end - start:.2f} s\")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "8d579663-e109-49dd-953d-a1ee4b9a38fe",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"diarization"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "6d011188-23b9-4420-a6ad-68d76e895f64",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"start=6.7s stop=7.1s speaker_SPEAKER_00\n",
"start=7.6s stop=8.6s speaker_SPEAKER_00\n",
"start=8.6s stop=10.0s speaker_SPEAKER_02\n",
"start=9.8s stop=11.0s speaker_SPEAKER_00\n",
"start=10.6s stop=14.7s speaker_SPEAKER_02\n",
"start=14.3s stop=17.9s speaker_SPEAKER_01\n",
"start=17.9s stop=21.5s speaker_SPEAKER_02\n",
"start=18.3s stop=18.4s speaker_SPEAKER_01\n",
"start=21.7s stop=28.6s speaker_SPEAKER_01\n",
"start=27.8s stop=29.5s speaker_SPEAKER_02\n"
]
}
],
"source": [
"for turn, _, speaker in diarization.itertracks(yield_label=True):\n",
" print(f\"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "2baa3f21-c031-4c3b-878d-0064de5480d4",
"metadata": {},
"source": [
"Nice! As we can see, the result preserves the same level of accuracy!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"openvino_notebooks": {
"imageUrl": "https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/pyannote-speaker-diarization/pyannote-speaker-diarization.png?raw=true",
"tags": {
"categories": [
"Model Demos"
],
"libraries": [],
"other": [],
"tasks": [
"Voice Activity Detection"
]
}
},
"vscode": {
"interpreter": {
"hash": "cec18e25feb9469b5ff1085a8097bdcd86db6a4ac301d6aeff87d0f3e7ce4ca5"
}
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}