philippemos commited on
Commit
1435c61
ยท
1 Parent(s): 695f284

modify pyannote name

Browse files
app.py CHANGED
@@ -97,7 +97,7 @@ def execute_diarization(file_uploader: st.uploaded_file_manager.UploadedFile, se
97
 
98
  if diarization_bool:
99
  pipeline_count += 1
100
- if diarization_name == 'PyAnnote':
101
  diarizer = pyannote_diarizer.PyannoteDiarizer(file_path)
102
  elif diarization_name == 'NeMo':
103
  diarizer = nemo_diarizer.NemoDiarizer(file_path, user_folder)
 
97
 
98
  if diarization_bool:
99
  pipeline_count += 1
100
+ if diarization_name == 'pyannote':
101
  diarizer = pyannote_diarizer.PyannoteDiarizer(file_path)
102
  elif diarization_name == 'NeMo':
103
  diarizer = nemo_diarizer.NemoDiarizer(file_path, user_folder)
computed_diarization_plots/{recording_1_PyAnnote.png โ†’ recording_1_pyannote.png} RENAMED
File without changes
computed_diarization_plots/{recording_2_PyAnnote.png โ†’ recording_2_pyannote.png} RENAMED
File without changes
computed_diarization_plots/{recording_3_PyAnnote.png โ†’ recording_3_pyannote.png} RENAMED
File without changes
configs.py CHANGED
@@ -1,5 +1,5 @@
1
  """General configs"""
2
- DIARIZATION_METHODS = ['PyAnnote', 'NeMo']
3
  AUDIO_SAMPLES_DIR = 'samples'
4
  UPLOADED_AUDIO_SAMPLES_DIR = 'uploaded_samples'
5
  PRECOMPUTED_DIARIZATION_FIGURE = 'computed_diarization_plots'
 
1
  """General configs"""
2
+ DIARIZATION_METHODS = ['pyannote', 'NeMo']
3
  AUDIO_SAMPLES_DIR = 'samples'
4
  UPLOADED_AUDIO_SAMPLES_DIR = 'uploaded_samples'
5
  PRECOMPUTED_DIARIZATION_FIGURE = 'computed_diarization_plots'
utils/text_utils.py CHANGED
@@ -116,8 +116,7 @@ def intro_container():
116
  "By generating a **speaker-aware transcript**, we can more easily interpret the generated"
117
  " conversation compared to a generated transcript without diarization. Much neater no? โœจ")
118
  container.caption(
119
- "Alright, that's more interpretable for human readers, but what can I do with "
120
- "those speaker-aware transcripts? ๐Ÿค”")
121
  container.markdown(
122
  "Speaker-aware transcripts can be a powerful tool for analyzing speech data:")
123
  container.markdown("""
@@ -127,12 +126,12 @@ def intro_container():
127
  """)
128
  container.markdown(
129
  "Now that we've seen the importance of speaker diarization and some of its applications,"
130
- " it's time to find out we can implement diarization algorithms.")
131
 
132
  container.markdown("---")
133
  container.header('๐Ÿ“ The workflow of a speaker diarization system')
134
  container.markdown(
135
- "Building robust and accurate speaker diarization is not as a trivial task."
136
  " Real world audio data is messy and complex due to many factors, such"
137
  " as having a noisy background, multiple speakers talking at the same time and "
138
  "subtle differences between the speakers' voices in pitch and tone. Moreover, speaker diarization systems often suffer "
@@ -188,53 +187,73 @@ def intro_container():
188
  "I will focus on the most popular **open-source** speaker diarization libraries. Make sure to check out"
189
  " [this link](https://wq2012.github.io/awesome-diarization/) for a more exhaustive list on different diarization libraries.")
190
 
191
- container.markdown("### 1. [PyAnnote](https://github.com/pyannote/pyannote-audio)")
192
  container.markdown(
193
- "Arguably one of the most popular libraries out there for speaker diarization.\n"
194
- "* Comes with a set of available pre-trained models for the VAD, embedder and segmentation model.\n"
195
- "* The inference pipeline can identify multiple speakers speaking at the same time (multi-label diarization).\n"
196
-
197
- "* It is not possible to define the number of speakers for the clustering algorithm. This could lead to an over-estimation or under-estimation of the number of speakers if they are known beforehand.")
198
- "* Note that the pre-trained models are based on the [VoxCeleb datasets](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) which consists of recording of celebrities extracted from YouTube. The audio quality of those recordings are crisp and clear, so you might need to retrain your model if you want to tackle other types of data like recorded phone calls.\n"
199
-
 
 
200
  container.markdown("### 2. [NVIDIA NeMo](https://developer.nvidia.com/nvidia-nemo)")
201
  container.markdown(
202
- "The Nvidia NeMo toolkit has separate collections for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) models.\n"
203
- "* Diarization results can be combined easily with ASR outputs to generate speaker-aware transcripts.\n"
204
- "* Possibility to define the number of speakers beforehand if they are known, resulting in a more accurate diarization output.\n"
205
- "* The fact that the NeMo toolkit also includes NLP related frameworks makes it easy to integrate the diarization outcome with downstream NLP tasks.\n"
206
- "* The models that the pre-trained networks were trained on were trained on [VoxCeleb datasets](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) as well as the [Fisher](https://catalog.ldc.upenn.edu/LDC2004T19) and [SwitchBoard](https://catalog.ldc.upenn.edu/LDC97S62) dataset, which consists of telephone conversations in English. This makes it more suitable as a starting point for fine-tuning a model for call-center use cases compared to the pre-trained models used in PyAnnote. More information about the pre-trained models can be found [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speaker_diarization/results.html).\n"
207
- )
 
 
 
208
  container.markdown("### 3. [Simple Diarizer](https://github.com/cvqluu/simple_diarizer)")
209
  container.markdown(
210
- "A simplified diarization pipeline that can be used for quick testing.\n"
211
- "* Uses the same pre-trained models as PyAnnote.\n"
212
- "* Unlike PyAnnote, this library does not include the option to fine tune the pre-trained models, making it less suitable for specialized use cases.\n"
213
- "* Similarly to Nvidia NeMo, there's the option to define the number of speakers beforehand.\n")
 
 
 
214
  container.markdown(
215
  "### 4. [SpeechBrain](https://github.com/speechbrain/speechbrain)")
216
  container.markdown(
217
- "All-in-one conversational AI toolkit based on PyTorch.\n"
218
- "* The SpeechBrain Ecosystem makes it easy to develop integrated speech solutions with systems such ASR, speaker identification, speech enhancement, speech separation and language identification.\n"
219
- "* Large amount of pre-trained models for various tasks. Checkout their [HuggingFace page](https://huggingface.co/speechbrain) for more information.\n"
220
- "* Contains [comprehensible tutorials](https://speechbrain.github.io/tutorial_basics.html) for various speech building blocks to easily get started.\n"
221
- "* Diarization pipeline is still not fully implemented yet but this [might change in the future](https://github.com/speechbrain/speechbrain/issues/1208).")
 
 
 
 
222
  container.markdown(
223
  "### 5. [Kaldi](https://github.com/kaldi-asr/kaldi)")
224
  container.markdown(
225
- "Speech recognition toolkit that is mainly targeted towards researchers. It is written in C++ and used to train speech recognition models and decode audio from audio files.\n"
226
- "* Relatively steep learning curve for beginners who don't have a lot of experience with speech recognition systems.\n"
227
- "* Pre-trained model is based on the [CALLHOME](https://catalog.ldc.upenn.edu/LDC97S42) dataset which consists of telephone conversation between native English speakers in North America.\n"
228
- "* Not suitable for a quick implementation of ASR/diarization systems. \n"
229
- "* Benefits from large community support. However, mainly targeted towards researchers and less suitable for production ready-solutions.\n")
 
 
 
 
230
 
231
  container.markdown(
232
  "### 6. [UIS-RNN](https://github.com/google/uis-rnn)")
233
  container.markdown(
234
- "A fully supervised end-to-end diarization model developed by Google.\n"
235
- "* No-pretrained model is available, so you need to train it from scratch on your custom transcribed data.\n"
236
- "* Relatively easy to train if you have a large set of pre-labeled data.\n"
237
- "* Both training and prediction require the usage of a GPU.\n")
 
 
 
 
238
  container.markdown(
239
  "Phew ๐Ÿ˜ฎโ€๐Ÿ’จ, that's quite some different frameworks! To make it easier to pick the right one for your use case, I've created a simple flowchart that can get you started on picking a suitable library depending on your use case.")
240
 
@@ -258,10 +277,12 @@ def demo_container(diarization_container):
258
  "Alright, you're probably very curious at this point to test out a few diarization techniques "
259
  "yourself. Below is a demo where you can try a few of the libraries that are mentioned above. "
260
  "You can try running multiple frameworks at the same time and compare their results by ticking multiple "
261
- "frameworks and clicking **'Apply'**.")
262
-
 
 
263
  diarization_container.caption(
264
- "Disclaimer: Keep in mind that due to computational constraints, only the first 30 seconds will be used for diarization when uploading your own recordings. "
265
  "For that reason, the diarization results may not be as accurate compared to diarization computed on longer recordings. This"
266
  " is simply due to the fact that the diarization algorithms will have much less data to sample from in order to create meaningful clusters of embeddings for "
267
  "each speaker. On the other hand, the diarization results from the provided samples are pre-computed on the whole recording (length of around โ‰ˆ10min) and thus "
@@ -279,4 +300,4 @@ def conlusion_container():
279
  container.markdown(
280
  "๐Ÿ‘‰ Then, we gave a comparison of different diarization frameworks and provided a guide for picking the best one for your use case.")
281
  container.markdown(
282
- "๐Ÿ‘‰ Finally, we provided you with an example to quickly try out a few of the diarization libraries.")
 
116
  "By generating a **speaker-aware transcript**, we can more easily interpret the generated"
117
  " conversation compared to a generated transcript without diarization. Much neater no? โœจ")
118
  container.caption(
119
+ "But what can I do with these speaker-aware transcripts? ๐Ÿค”")
 
120
  container.markdown(
121
  "Speaker-aware transcripts can be a powerful tool for analyzing speech data:")
122
  container.markdown("""
 
126
  """)
127
  container.markdown(
128
  "Now that we've seen the importance of speaker diarization and some of its applications,"
129
+ " it's time to find out how we can implement diarization algorithms.")
130
 
131
  container.markdown("---")
132
  container.header('๐Ÿ“ The workflow of a speaker diarization system')
133
  container.markdown(
134
+ "Building robust and accurate speaker diarization is not a trivial task."
135
  " Real world audio data is messy and complex due to many factors, such"
136
  " as having a noisy background, multiple speakers talking at the same time and "
137
  "subtle differences between the speakers' voices in pitch and tone. Moreover, speaker diarization systems often suffer "
 
187
  "I will focus on the most popular **open-source** speaker diarization libraries. Make sure to check out"
188
  " [this link](https://wq2012.github.io/awesome-diarization/) for a more exhaustive list on different diarization libraries.")
189
 
190
+ container.markdown("### 1. [pyannote](https://github.com/pyannote/pyannote-audio)")
191
  container.markdown(
192
+ "Arguably one of the most popular libraries out there for speaker diarization.\n")
193
+ container.markdown(
194
+ "๐Ÿ‘‰ Note that the pre-trained models are based on the [VoxCeleb datasets](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) which consists of recording of celebrities extracted from YouTube. The audio quality of those recordings are crisp and clear, so you might need to retrain your model if you want to tackle other types of data like recorded phone calls.\n")
195
+ container.markdown(
196
+ "โž• Comes with a set of available pre-trained models for the VAD, embedder and segmentation model.\n")
197
+ container.markdown(
198
+ "โž• The inference pipeline can identify multiple speakers speaking at the same time (multi-label diarization).\n")
199
+ container.markdown(
200
+ "โž– It is not possible to define the number of speakers for the clustering algorithm. This could lead to an over-estimation or under-estimation of the number of speakers if they are known beforehand.")
201
  container.markdown("### 2. [NVIDIA NeMo](https://developer.nvidia.com/nvidia-nemo)")
202
  container.markdown(
203
+ "The Nvidia NeMo toolkit has separate collections for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) models.\n")
204
+ container.markdown(
205
+ "๐Ÿ‘‰ The models that the pre-trained networks were trained on were trained on [VoxCeleb datasets](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) as well as the [Fisher](https://catalog.ldc.upenn.edu/LDC2004T19) and [SwitchBoard](https://catalog.ldc.upenn.edu/LDC97S62) dataset, which consists of telephone conversations in English. This makes it more suitable as a starting point for fine-tuning a model for call-center use cases compared to the pre-trained models used in pyannote. More information about the pre-trained models can be found [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speaker_diarization/results.html).\n")
206
+ container.markdown(
207
+ "โž• Diarization results can be combined easily with ASR outputs to generate speaker-aware transcripts.\n")
208
+ container.markdown(
209
+ "โž• Possibility to define the number of speakers beforehand if they are known, resulting in a more accurate diarization output.\n")
210
+ container.markdown(
211
+ "โž• The fact that the NeMo toolkit also includes NLP related frameworks makes it easy to integrate the diarization outcome with downstream NLP tasks.\n")
212
  container.markdown("### 3. [Simple Diarizer](https://github.com/cvqluu/simple_diarizer)")
213
  container.markdown(
214
+ "A simplified diarization pipeline that can be used for quick testing.\n")
215
+ container.markdown(
216
+ "๐Ÿ‘‰ Uses the same pre-trained models as pyannote.\n")
217
+ container.markdown(
218
+ "โž• Similarly to Nvidia NeMo, there's the option to define the number of speakers beforehand.\n")
219
+ container.markdown(
220
+ "โž– Unlike pyannote, this library does not include the option to fine tune the pre-trained models, making it less suitable for specialized use cases.\n")
221
  container.markdown(
222
  "### 4. [SpeechBrain](https://github.com/speechbrain/speechbrain)")
223
  container.markdown(
224
+ "All-in-one conversational AI toolkit based on PyTorch.\n")
225
+ container.markdown(
226
+ "โž• The SpeechBrain Ecosystem makes it easy to develop integrated speech solutions with systems such ASR, speaker identification, speech enhancement, speech separation and language identification.\n")
227
+ container.markdown(
228
+ "โž• Large amount of pre-trained models for various tasks. Checkout their [HuggingFace page](https://huggingface.co/speechbrain) for more information.\n")
229
+ container.markdown(
230
+ "โž• Contains [comprehensible tutorials](https://speechbrain.github.io/tutorial_basics.html) for various speech building blocks to easily get started.\n")
231
+ container.markdown(
232
+ "โž– Diarization pipeline is still not fully implemented yet but this [might change in the future](https://github.com/speechbrain/speechbrain/issues/1208).")
233
  container.markdown(
234
  "### 5. [Kaldi](https://github.com/kaldi-asr/kaldi)")
235
  container.markdown(
236
+ "Speech recognition toolkit that is mainly targeted towards researchers. It is written in C++ and used to train speech recognition models and decode audio from audio files.\n")
237
+ container.markdown(
238
+ "๐Ÿ‘‰ Pre-trained model is based on the [CALLHOME](https://catalog.ldc.upenn.edu/LDC97S42) dataset which consists of telephone conversation between native English speakers in North America.\n")
239
+ container.markdown(
240
+ "๐Ÿ‘‰ Benefits from large community support. However, mainly targeted towards researchers and less suitable for production ready-solutions.\n")
241
+ container.markdown(
242
+ "โž– Relatively steep learning curve for beginners who don't have a lot of experience with speech recognition systems.\n")
243
+ container.markdown(
244
+ "โž– Not suitable for a quick implementation of ASR/diarization systems. \n")
245
 
246
  container.markdown(
247
  "### 6. [UIS-RNN](https://github.com/google/uis-rnn)")
248
  container.markdown(
249
+ "A fully supervised end-to-end diarization model developed by Google.\n")
250
+ container.markdown(
251
+ "๐Ÿ‘‰ Both training and prediction require the usage of a GPU.\n")
252
+ container.markdown(
253
+ "โž– No-pretrained model is available, so you need to train it from scratch on your custom transcribed data.\n")
254
+ container.markdown(
255
+ "โž• Relatively easy to train if you have a large set of pre-labeled data.\n")
256
+ container.markdown("\n")
257
  container.markdown(
258
  "Phew ๐Ÿ˜ฎโ€๐Ÿ’จ, that's quite some different frameworks! To make it easier to pick the right one for your use case, I've created a simple flowchart that can get you started on picking a suitable library depending on your use case.")
259
 
 
277
  "Alright, you're probably very curious at this point to test out a few diarization techniques "
278
  "yourself. Below is a demo where you can try a few of the libraries that are mentioned above. "
279
  "You can try running multiple frameworks at the same time and compare their results by ticking multiple "
280
+ "frameworks and clicking **'Apply'**.\n")
281
+ diarization_container.markdown(
282
+ "**Note:** We are including **Nemo** and **pyannote** frameworks since we are operating on a single environment and a dependency conflict can occur when including other frameworks (most diarization frameworks rely "
283
+ "on different and incompatible versions of the same shared packages).")
284
  diarization_container.caption(
285
+ "**Disclaimer**: Keep in mind that due to computational constraints, only the first 30 seconds will be used for diarization when uploading your own recordings. "
286
  "For that reason, the diarization results may not be as accurate compared to diarization computed on longer recordings. This"
287
  " is simply due to the fact that the diarization algorithms will have much less data to sample from in order to create meaningful clusters of embeddings for "
288
  "each speaker. On the other hand, the diarization results from the provided samples are pre-computed on the whole recording (length of around โ‰ˆ10min) and thus "
 
300
  container.markdown(
301
  "๐Ÿ‘‰ Then, we gave a comparison of different diarization frameworks and provided a guide for picking the best one for your use case.")
302
  container.markdown(
303
+ "๐Ÿ‘‰ Finally, we provided you with an example to quickly try out a few of the diarization libraries.")