agorlanov
fix_readme
6144c99
|
raw
history blame
2.05 kB
metadata
title: Denoise And Diarization
emoji: 🐠
colorFrom: gray
colorTo: gray
sdk: gradio
sdk_version: 3.28.0
app_file: app.py
pinned: false

How inference:

  1. huggingface
  2. telegram bot
  3. run local inference:
    1. GUI: python app.py
    2. Inference local: python main_pipeline.py --audio-path dialog.mp3

About pipeline:

  • denoise audio
  • vad(voice activity detector)
  • speaker embeddings from each vad fragments
  • clustering this embeddings

Inference for hardware

inference time for file dialog.mp3
cpu 2v CPU huggingece 453.8 s/it
gpu tesla v100 8.23 s/it

I know a lot of methods for this task:

  • separation: using separation models(need longtime train and finetune)
  • diarization
    • speaker_embedding+Clustering knowing numbers of speakers
    • overlap speech detection
    • speaker_embedding+Clustering knowing numbers of speakers
    • asr_each_word+speaker_embedding+Clustering numbers of speakers
  • end-to-end nn diarization (sota worst than just diarization)

For this task i used speaker_embedding+Clustering unknowing numbers of speakers

How i can improve (i have experience in it):

  • preprocessing
    • estimate SNR(signal noise rate) and if input clean dont use denoising
  • train:
    • custom speaker recognition model
    • custom overlap speech detector
    • custom speech separation model:
  • Using FaceVad if there are video
  • improve speed and ram size:
    • quantization models
    • optimate models for hardware onnx=>openvino/tensorrt/caffe2 or coreml
    • pruning models
    • distillation(train small model with big model)

How to improve besides what's on top:

  • delete overlap speech using asr