Spaces:
Runtime error
Runtime error
title: Denoise And Diarization | |
emoji: ๐ | |
colorFrom: gray | |
colorTo: gray | |
sdk: gradio | |
sdk_version: 3.28.0 | |
app_file: app.py | |
pinned: false | |
# How inference: | |
1) [huggingface](https://huggingface.co/spaces/deepkotix/denoise_and_diarization) | |
2) [telegram bot](https://t.me/diarizarion_bot) | |
3) run local inference: | |
1) GUI: | |
`python app.py` | |
2) Inference local: | |
`python main_pipeline.py --audio-path dialog.mp3` | |
# About pipeline: | |
+ denoise audio | |
+ vad(voice activity detector) | |
+ speaker embeddings from each vad fragments | |
+ clustering this embeddings | |
# Inference for hardware | |
| | inference time for file dialog.mp3 | | |
|-----------------------|:------------------------------------:| | |
| cpu 2v CPU huggingece | 453.8 s/it | | |
| gpu tesla v100 | 8.23 s/it | | |
I know a lot of methods for this task: | |
+ separation: using separation models(need longtime train and finetune) | |
+ diarization | |
+ speaker_embedding+Clustering knowing numbers of speakers | |
+ overlap speech detection | |
+ speaker_embedding+Clustering knowing numbers of speakers | |
+ asr_each_word+speaker_embedding+Clustering numbers of speakers | |
+ end-to-end nn diarization (sota worst than just diarization) | |
For this task i used speaker_embedding+Clustering unknowing numbers of speakers | |
How i can improve (i have experience in it): | |
+ preprocessing | |
+ estimate SNR(signal noise rate) and if input clean dont use denoising | |
+ train: | |
+ custom speaker recognition model | |
+ custom overlap speech detector | |
+ custom speech separation model: | |
+ [MossFormer](https://github.com/alibabasglab/MossFormer) | |
+ [speechbrain](https://speechbrain.github.io/) | |
+ Using FaceVad if there are video | |
+ improve speed and ram size: | |
+ quantization models | |
+ optimate models for hardware onnx=>openvino/tensorrt/caffe2 or coreml | |
+ pruning models | |
+ distillation(train small model with big model) | |
How to improve besides what's on top: | |
+ delete overlap speech using asr | |