--- title: Denoise And Diarization emoji: 🐠 colorFrom: gray colorTo: gray sdk: gradio sdk_version: 3.28.0 app_file: app.py pinned: false --- # How inference: 1) [huggingface](https://huggingface.co/spaces/speechmaster/denoise_and_diarization) 2) run local inference: 1) GUI: `python app.py` 2) Inference local: `python main_pipeline.py --audio-path dialog.mp3 --out-folder-path out` # About pipeline: + denoise audio + vad(voice activity detector) + speaker embeddings from each vad fragments + clustering this embeddings # Inference for hardware | | inference time for file dialog.mp3 | |-----------------------|:------------------------------------:| | cpu 2v CPU huggingece | 453.8 s/it | | gpu tesla v100 | 8.23 s/it | I know a lot of methods for this task: + separation: using separation models(need longtime train and finetune) + diarization + speaker_embedding+Clustering knowing numbers of speakers + overlap speech detection + speaker_embedding+Clustering knowing numbers of speakers + asr_each_word+speaker_embedding+Clustering numbers of speakers + end-to-end nn diarization (sota worst than just diarization) For this task i used speaker_embedding+Clustering unknowing numbers of speakers How i can improve (i have experience in it): + preprocessing + estimate SNR(signal noise rate) and if input clean dont use denoising + train: + custom speaker recognition model + custom overlap speech detector + custom speech separation model: + [MossFormer](https://github.com/alibabasglab/MossFormer) + [speechbrain](https://speechbrain.github.io/) + Using FaceVad if there are video + improve speed and ram size: + quantization models + optimate models for hardware onnx=>openvino/tensorrt/caffe2 or coreml + pruning models + distillation(train small model with big model) How to improve besides what's on top: + delete overlap speech using asr + delete overlap speech using overlap detection