Spaces:

speechmaster
/

denoise_and_diarization

Runtime error

App Files Files Community

denoise_and_diarization / README.md

Artem Gorlanov

fix

d7280b3 about 1 year ago

preview code

raw

history blame

2.24 kB

	---
	title: Denoise And Diarization
	emoji: 🐠
	colorFrom: gray
	colorTo: gray
	sdk: gradio
	sdk_version: 3.28.0
	app_file: app.py
	pinned: false
	---

	# How run:
	1) [huggingface](https://huggingface.co/spaces/speechmaster/denoise_and_diarization)
	2) run local inference:
	1) GUI:
	`python app.py`
	2) Inference local:
	`python main_pipeline.py --audio-path dialog.mp3 --out-folder-path out`
	3) run docker:
	```
	docker login registry.hf.space
	docker run -it -p 7860:7860 --platform=linux/amd64 \
	registry.hf.space/speechmaster-denoise-and-diarization:latest python app.py
	```

	# About pipeline:
	+ denoise audio
	+ vad(voice activity detector)
	+ speaker embeddings from each vad fragments
	+ clustering this embeddings


	# Inference for hardware

	\| \| inference time for file dialog.mp3 \|
	\|-----------------------\|:------------------------------------:\|
	\| cpu 2v CPU huggingece \| 453.8 s/it \|
	\| gpu tesla v100 \| 8.23 s/it \|

	# Approaches
	I know a lot of methods for this task:
	+ separation: using separation models(need longtime train and finetune)
	+ diarization
	+ speaker_embedding+Clustering knowing numbers of speakers
	+ overlap speech detection
	+ speaker_embedding+Clustering knowing numbers of speakers
	+ asr_each_word+speaker_embedding+Clustering numbers of speakers
	+ end-to-end nn diarization (sota worst than just diarization)

	For this task i used speaker_embedding+Clustering unknowing numbers of speakers


	# How i can improve:
	+ Fix preprocessing
	+ estimate SNR(signal noise rate) and if input clean dont use denoising
	+ Add train:
	+ custom speaker recognition model
	+ custom overlap speech detector
	+ custom speech separation model:
	+ [MossFormer](https://github.com/alibabasglab/MossFormer)
	+ [speechbrain](https://speechbrain.github.io/)
	+ Using FaceVad if there are video
	+ improve speed and ram size:
	+ quantization models
	+ optimate models for hardware onnx=>openvino/tensorrt/caffe2 or coreml
	+ pruning models
	+ distillation(train small model with big model)



	How to improve besides what's on top:
	+ delete overlap speech using asr
	+ delete overlap speech using overlap detection