metadata

license: mit
pipeline_tag: feature-extraction
tags:
  - bark
  - tts
  - hubert
  - text-to-speech
datasets:
  - GitMylo/bark-semantic-training

Bark-voice-cloning

Bark-voice-cloning is a model which processes the outputs from a HuBERT model, and turns them into semantic tokens compatible with bark text to speech.

This can be used for many things, including speech transfer and voice cloning.

code repo

The models in this repo

quantifier_hubert_base_ls960.pth (the model trained on literature for 4 epochs)
quantifier_hubert_base_ls960_14.pth (the model trained on literature for 10 more epochs, based on the previous)

(Please use the model manager from the code repo for easy downloading of models)

Voice cloning

Voice cloning is creating a new voice for text-to-speech.

Process:

Load your wav audio file into your pytorch application
For the fine prompt extract discrete representations. (These are used by bark to know about the voice), make sure to .squeeze() the resulting codes.
For the coarse prompt do fine_prompt[:2, :], to get the coarse prompt from a fine prompt.
For the semantics, load a HuBERT model without Kmeans (I personally use the audiolm-pytorch implementation's hubertwithkmeans, but i edited it to skip kmeans.)
Next, to get the actual semantic tokens, run the tokens through this model. Your output will be compatible with bark.
Save these files in an npz with numpy.savez(semantic_prompt=semantics, fine_prompt=fine, coarse_prompt=coarse). This is your speaker file containing your cloned voice.

Voice masking

Voice masking is replacing a voice in an audio clip for speech-to-speech.

Random

Replacing a voice in an audio clip with a voice generated by bark.

process:

Extract semantics from the audio clip using HuBERT and this model
Run semantic_to_waveform from bark.api with the extracted semantics
The previous step returns the generated audio.

Transfer

Replacing a voice with a voice from another audio clip.

process:

Create a speaker file using the steps under the voice cloning section
Extract the semantics from the clip with the text you want spoken
Run semantics_to_waveform from bark.api with the extracted semantics, and the speaker prompt that you created on step 1.
The previous step returns the generated audio.

Disclaimer

I am not responsible for any misuse of this model. I do not agree with cloning people's voices without permission. Please make sure it is appropriate to clone someone's voice before doing so.