Diarization trouble

#1
by nexuslux - opened

Hey! Thanks for your help on this project, very useful to me. I got it working locally but struggle with the diarization.

I've done the following. Edited the config.json5 file with diarization set to true..
I've created a .env with my hugging face access token.
I've tested it without diarization and it works great.

So it seems I just struggle a bit to get this working locally. Based on the readme instructions for diarization - does that mean I need to move the app.py file from the home directory to the /src/diarization directory and re-run the app? And what about the part about "Download the JSON file after running Whisper on an audio file" -- sorry. Just that I'm really struggling :P

This is the error:

Screenshot 2024-01-24 at 14.35.04.png

Diarization
To detect different speakers in the audio, you can use the whisper-diarization application.

Download the JSON file after running Whisper on an audio file, and then run app.py in the whisper-diarization repository with the audio file and the JSON file as arguments.

Hi nexuslux!

Originally, I thought the issue might be due to uninitialized diarization, but since you have set the access token, it should be fine. I suspect the problem might be related to the execution environment. Do you have other projects sharing the same environment under envs? You can try running "pip install -r requirements.txt -U" under the "whisper-webui-translate" directory.

Also, "app.py" doesn't need to be moved to the "/src/diarization" directory. Under normal circumstances, it should run correctly without needing to move it.
image.png

I tested the environment using Python 3.10, although 3.11 should also be okay...

I hope you can run it successfully. Thank you.

Thanks for your reply! I'm a novice when it comes to using python, so apologies if this is a really simple error and I'm just missing it. I'm running this on an m1 mac and using vs code for IDE. and the browser is in chrome.

Step 1: I've created a new folder and cloned into it... git clone "https://huggingface.co/spaces/avans06/whisper-webui-translate"
Step 2: So now I have the directory structure desktop/whisperwebuitranslate/whisper-webui-translate/
Step 3: i've created a new environment in the initial folder 'whisperwebuitranslate' --> conda create --name whisperwebuitranslate python=3.10
Step 4: i've installed all the packages in the main directory for requirements.txt as well as the requirements.txt in the /src/diarization folder as well.
Step 5: Run the app... python app.py --input_audio_max_duration -1
step 6: test with short .m4a file -- works fine, transcription completes.
Step 7: create a .env file in the /desktop/whisperwebuitranslate/whisper-webui-translate/.env (and its just HK_ACCESS_TOKEN="xxx" (with my token inside the brackets)
Step 8: change the config.json5 file to have my access token in there as well...
Step 9: run the app again... python app.py --input_audio_max_duration -1
Step 10: same file i earlier transcribe and this time diarization is on... and then I get the error.

Error:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/tristan/miniconda3/envs/whisperwebuitranslate/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/tristan/Desktop/Whisperwebuitest/whisper-webui-translate/src/diarization/diarizationContainer.py", line 58, in execute
result = list(model.run(audio_file, **kwargs))
File "/Users/tristan/Desktop/Whisperwebuitest/whisper-webui-translate/src/diarization/diarization.py", line 86, in run
diarization = self.pipeline(target_file, **kwargs)
TypeError: 'NoneType' object is not callable
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/tristan/Desktop/Whisperwebuitest/whisper-webui-translate/app.py", line 372, in transcribe_webui
result = self.transcribe_file(model, source.source_path, whisperLangCode, task, vadOptions, scaled_progress_listener, **decodeOptions)
File "/Users/tristan/Desktop/Whisperwebuitest/whisper-webui-translate/app.py", line 552, in transcribe_file
diarization_result = list(self.diarization.run(audio_path, **self.diarization_kwargs))
File "/Users/tristan/Desktop/Whisperwebuitest/whisper-webui-translate/src/diarization/diarizationContainer.py", line 30, in run
result = pool.apply(self.execute, (audio_file,), kwargs)
File "/Users/tristan/miniconda3/envs/whisperwebuitranslate/lib/python3.10/multiprocessing/pool.py", line 360, in apply
return self.apply_async(func, args, kwds).get()
File "/Users/tristan/miniconda3/envs/whisperwebuitranslate/lib/python3.10/multiprocessing/pool.py", line 774, in get
raise self._value
TypeError: 'NoneType' object is not callable

And in the terminal I have the following:
(whisperwebuitranslate) (base) tristan@Tristans-MBA whisper-webui-translate % python app.py --input_audio_max_duration -1
Using whisper implementation: faster-whisper
/Users/tristan/miniconda3/envs/whisperwebuitranslate/lib/python3.10/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")
/Users/tristan/miniconda3/envs/whisperwebuitranslate/lib/python3.10/site-packages/torch_audiomentations/utils/io.py:27: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")
Queue mode enabled (concurrency count: 1)
Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().
Creating whisper container for faster-whisper
Using cache found in /Users/tristan/.cache/torch/hub/snakers4_silero-vad_master
Created Silerio model
Getting timestamps from audio file: /var/folders/gm/5f3_ryvd7lxbdw2r39kh54gw0000gp/T/gradio/904f8e8475a61e12b364d5ac4bda825ae41ce88c/weishantest.m4a, start: 0, duration: 298.581333
Processing VAD in chunk from 00:00.000 to 04:58.581
VAD processing took 12.171367542003281 seconds
Transcribing non-speech:
[{'end': 88.986, 'expand_amount': 0.8200000000000074, 'start': 0.0},
{'end': 156.176, 'expand_amount': 0.0, 'start': 88.986},
{'end': 208.0, 'expand_amount': 0.0, 'start': 156.176},
{'end': 274.432, 'expand_amount': 0.0, 'start': 208.0},
{'end': 299.542625, 'start': 274.432}]
Processing timestamps:
[{'end': 88.986, 'expand_amount': 0.8200000000000074, 'start': 0.0},
{'end': 156.176, 'expand_amount': 0.0, 'start': 88.986},
{'end': 208.0, 'expand_amount': 0.0, 'start': 156.176},
{'end': 274.432, 'expand_amount': 0.0, 'start': 208.0},
{'end': 299.542625, 'start': 274.432}]
Running whisper 0: from 00:00.000 to 01:28.986 , duration: 88.986 expanded: 0.8200000000000074 , prompt: None , detected language: None
Loading faster whisper model tiny for device None
Whisper took 4.971264166990295 seconds
Running whisper 1: from 01:28.986 to 02:36.176 , duration: 67.18999999999998 expanded: 0.0 , prompt: None , detected language: en
Whisper took 2.3218839580076747 seconds
Running whisper 2: from 02:36.176 to 03:28.000 , duration: 51.82400000000001 expanded: 0.0 , prompt: None , detected language: en
Whisper took 2.0640250829746947 seconds
Running whisper 3: from 03:28.000 to 04:34.432 , duration: 66.43200000000002 expanded: 0.0 , prompt: Yeah, yeah, yeah, yeah. Yeah, all you can have that in. , detected language: en
Whisper took 2.6510332920006476 seconds
Running whisper 4: from 04:34.432 to 04:59.543 , duration: 25.11062499999997 expanded: 0 , prompt: None , detected language: en
Whisper took 1.2385675830300897 seconds
Diarizing /var/folders/gm/5f3_ryvd7lxbdw2r39kh54gw0000gp/T/gradio/904f8e8475a61e12b364d5ac4bda825ae41ce88c/weishantest.m4a
Created diarization context with auto cleanup timeout of 60 seconds
Loading diarization model from cache
/Users/tristan/miniconda3/envs/whisperwebuitranslate/lib/python3.10/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")
/Users/tristan/miniconda3/envs/whisperwebuitranslate/lib/python3.10/site-packages/torch_audiomentations/utils/io.py:27: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")

Could not download 'pyannote/speaker-diarization-3.0' pipeline.
It might be because the pipeline is private or gated so make
sure to authenticate. Visit https://hf.co/settings/tokens to
create your access token and retry with:

Pipeline.from_pretrained('pyannote/speaker-diarization-3.0',
... use_auth_token=YOUR_AUTH_TOKEN)

If this still does not work, it might be because the pipeline is gated:
visit https://hf.co/pyannote/speaker-diarization-3.0 to accept the user conditions.
Diarization - using CPU
ffmpeg version 6.1.1 Copyright (c) 2000-2023 the FFmpeg developers
built with Apple clang version 15.0.0 (clang-1500.1.0.2.5)
configuration: --prefix=/opt/homebrew/Cellar/ffmpeg/6.1.1_2 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags='-Wl,-ld_classic' --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libaribb24 --enable-libbluray --enable-libdav1d --enable-libharfbuzz --enable-libjxl --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libopenvino --enable-libspeex --enable-libsoxr --enable-libzmq --enable-libzimg --disable-libjack --disable-indev=jack --enable-videotoolbox --enable-audiotoolbox --enable-neon
libavutil 58. 29.100 / 58. 29.100
libavcodec 60. 31.102 / 60. 31.102
libavformat 60. 16.100 / 60. 16.100
libavdevice 60. 3.100 / 60. 3.100
libavfilter 9. 12.100 / 9. 12.100
libswscale 7. 5.100 / 7. 5.100
libswresample 4. 12.100 / 4. 12.100
libpostproc 57. 3.100 / 57. 3.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/var/folders/gm/5f3_ryvd7lxbdw2r39kh54gw0000gp/T/gradio/904f8e8475a61e12b364d5ac4bda825ae41ce88c/weishantest.m4a':
Metadata:
major_brand : M4A
minor_version : 0
compatible_brands: M4A isommp42
creation_time : 2024-01-19T04:59:51.000000Z
title : Chat with Weishan Quant Youth Tracker
iTunSMPB : 00000000 00000742 0000018F 0000000000DAA72F 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
voice-memo-uuid : A60595FC-B7EF-4E38-B566-77DCD75B6195
encoder : com.apple.VoiceMemos (iPad Version 14.0 (Build 23A344))
Duration: 00:04:58.58, start: 0.038708, bitrate: 64 kb/s
Stream #0:00x1: Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, mono, fltp, 62 kb/s (default)
Metadata:
creation_time : 2024-01-19T04:59:51.000000Z
handler_name : Core Media Audio
vendor_id : [0][0][0][0]
Stream mapping:
Stream #0:0 -> #0:0 (aac (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, wav, to '/var/folders/gm/5f3_ryvd7lxbdw2r39kh54gw0000gp/T/diarization_32udb_yi.wav':
Metadata:
major_brand : M4A
minor_version : 0
compatible_brands: M4A isommp42
voice-memo-uuid : A60595FC-B7EF-4E38-B566-77DCD75B6195
INAM : Chat with Weishan Quant Youth Tracker
iTunSMPB : 00000000 00000742 0000018F 0000000000DAA72F 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
ISFT : Lavf60.16.100
Stream #0:0(und): Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, mono, s16, 768 kb/s (default)
Metadata:
creation_time : 2024-01-19T04:59:51.000000Z
handler_name : Core Media Audio
vendor_id : [0][0][0][0]
encoder : Lavc60.31.102 pcm_s16le
[out#0/wav @ 0x14ee398d0] video:0kB audio:27988kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000433%
size= 27988kB time=00:04:58.52 bitrate= 768.1kbits/s speed=1.17e+03x
Started auto cleanup of pool in 60 seconds
Deleting temporary source file: /var/folders/gm/5f3_ryvd7lxbdw2r39kh54gw0000gp/T/gradio/904f8e8475a61e12b364d5ac4bda825ae41ce88c/weishantest.m4a
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/tristan/miniconda3/envs/whisperwebuitranslate/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/tristan/Desktop/Whisperwebuitest/whisper-webui-translate/src/diarization/diarizationContainer.py", line 58, in execute
result = list(model.run(audio_file, **kwargs))
File "/Users/tristan/Desktop/Whisperwebuitest/whisper-webui-translate/src/diarization/diarization.py", line 86, in run
diarization = self.pipeline(target_file, **kwargs)
TypeError: 'NoneType' object is not callable
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/tristan/Desktop/Whisperwebuitest/whisper-webui-translate/app.py", line 372, in transcribe_webui
result = self.transcribe_file(model, source.source_path, whisperLangCode, task, vadOptions, scaled_progress_listener, **decodeOptions)
File "/Users/tristan/Desktop/Whisperwebuitest/whisper-webui-translate/app.py", line 552, in transcribe_file
diarization_result = list(self.diarization.run(audio_path, **self.diarization_kwargs))
File "/Users/tristan/Desktop/Whisperwebuitest/whisper-webui-translate/src/diarization/diarizationContainer.py", line 30, in run
result = pool.apply(self.execute, (audio_file,), kwargs)
File "/Users/tristan/miniconda3/envs/whisperwebuitranslate/lib/python3.10/multiprocessing/pool.py", line 360, in apply
return self.apply_async(func, args, kwds).get()
File "/Users/tristan/miniconda3/envs/whisperwebuitranslate/lib/python3.10/multiprocessing/pool.py", line 774, in get
raise self._value
TypeError: 'NoneType' object is not callable

Executing cleanup of pool
Stopped auto cleanup of pool
Closing pool of 1 processes

Think the key part is here:

Could not download 'pyannote/speaker-diarization-3.0' pipeline.
It might be because the pipeline is private or gated so make
sure to authenticate. Visit https://hf.co/settings/tokens to
create your access token and retry with:

Pipeline.from_pretrained('pyannote/speaker-diarization-3.0',
... use_auth_token=YOUR_AUTH_TOKEN)

But I have already put my auth token in...?

Never mind! this was a token issue in the end. I had accepted 3 different versions of pyannote. But the one that is being called is this one:
https://huggingface.co/pyannote/speaker-diarization-3.0

Cheers!

I apologize for the confusion. In the project, I have set the version to speaker-diarization-3.0, not the original version.
Once you accept version 3.0, it should run smoothly.

Cheers!

Not at all! I should have accepted all of them :)

Hey,
Thanks for the discussion, it was super helpful to figure out the issue. In case someone else end up here like me you also need to accept: https://huggingface.co/pyannote/segmentation-3.0

Comments from the diarization source

        """
        1.Install pyannote.audio 3.0 with pip install pyannote.audio
        2.Accept pyannote/segmentation-3.0 user conditions
        3.Accept pyannote/speaker-diarization-3.0 user conditions
        4.Create access token at hf.co/settings/tokens.
        https://huggingface.co/pyannote/speaker-diarization-3.0
        """

Sign up or log in to comment