bark-voice-cloning / README.md
GitMylo's picture
Add squeeze info
28dc103
|
raw
history blame
3 kB
metadata
license: mit
pipeline_tag: feature-extraction
tags:
  - bark
  - tts
  - hubert
  - text-to-speech
datasets:
  - GitMylo/bark-semantic-training

Bark-voice-cloning

Bark-voice-cloning is a model which processes the outputs from a HuBERT model, and turns them into semantic tokens compatible with bark text to speech.

This can be used for many things, including speech transfer and voice cloning.

code repo

The models in this repo

(Please use the model manager from the code repo for easy downloading of models)

Voice cloning

Voice cloning is creating a new voice for text-to-speech.

Process:

  1. Load your wav audio file into your pytorch application
  2. For the fine prompt extract discrete representations. (These are used by bark to know about the voice), make sure to .squeeze() the resulting codes.
  3. For the coarse prompt do fine_prompt[:2, :], to get the coarse prompt from a fine prompt.
  4. For the semantics, load a HuBERT model without Kmeans (I personally use the audiolm-pytorch implementation's hubertwithkmeans, but i edited it to skip kmeans.)
  5. Next, to get the actual semantic tokens, run the tokens through this model. Your output will be compatible with bark.
  6. Save these files in an npz with numpy.savez(semantic_prompt=semantics, fine_prompt=fine, coarse_prompt=coarse). This is your speaker file containing your cloned voice.

Voice masking

Voice masking is replacing a voice in an audio clip for speech-to-speech.

Random

Replacing a voice in an audio clip with a voice generated by bark.

process:

  1. Extract semantics from the audio clip using HuBERT and this model
  2. Run semantic_to_waveform from bark.api with the extracted semantics
  3. The previous step returns the generated audio.

Transfer

Replacing a voice with a voice from another audio clip.

process:

  1. Create a speaker file using the steps under the voice cloning section
  2. Extract the semantics from the clip with the text you want spoken
  3. Run semantics_to_waveform from bark.api with the extracted semantics, and the speaker prompt that you created on step 1.
  4. The previous step returns the generated audio.

Disclaimer

I am not responsible for any misuse of this model. I do not agree with cloning people's voices without permission. Please make sure it is appropriate to clone someone's voice before doing so.