GitMylo
/

bark-voice-cloning

Feature Extraction

Model card Files Files and versions Community

GitMylo commited on May 18, 2023

Commit

537bec2

·

1 Parent(s): d8e2d6d

README.md

Initial readme

Files changed (1) hide show

README.md +44 -0

README.md CHANGED Viewed

@@ -1,3 +1,47 @@
 ---
 license: mit
 ---

 ---
 license: mit
+pipeline_tag: feature-extraction
+tags:
+- bark
+- tts
+- hubert
+- text-to-speech
 ---
+# Bark-voice-cloning
+Bark-voice-cloning is a model which processes the outputs from a HuBERT model, and turns them into semantic tokens compatible with bark text to speech.
+This can be used for many things, including speech transfer and voice cloning.
+# Voice cloning
+Voice cloning is creating a new voice for text-to-speech.
+Process:
+1. Load your wav audio file into your pytorch application
+2. For the fine prompt extract [discrete representations](https://github.com/facebookresearch/encodec#extracting-discrete-representations). (These are used by bark to know about the voice)
+3. For the coarse prompt do `fine_prompt[:2, :]`, to get the coarse prompt from a fine prompt.
+4. For the semantics, load a HuBERT model without Kmeans (I personally use the [audiolm-pytorch](https://github.com/lucidrains/audiolm-pytorch) implementation's hubertwithkmeans, but i edited it to skip kmeans.)
+5. Next, to get the actual semantic tokens, run the tokens through this model. Your output will be compatible with bark.
+6. Save these files in an npz with `numpy.savez(semantic_prompt=semantics, fine_prompt=fine, coarse_prompt=coarse)`. This is your speaker file containing your cloned voice.
+# Voice masking
+Voice masking is replacing a voice in an audio clip for speech-to-speech.
+## Random
+Replacing a voice in an audio clip with a voice generated by bark.
+process:
+1. Extract semantics from the audio clip using HuBERT and this model
+2. Run `semantic_to_waveform` from `bark.api` with the extracted semantics
+3. The previous step returns the generated audio.
+## Transfer
+Replacing a voice with a voice from another audio clip.
+process:
+1. Create a speaker file using the steps under the voice cloning section
+2. Extract the semantics from the clip with the text you want spoken
+3. Run `semantics_to_waveform` from `bark.api` with the extracted semantics, and the speaker prompt that you created on step 1.
+4. The previous step returns the generated audio.
+# Disclaimer
+I am not responsible for any misuse of this model. I do not agree with cloning people's voices without permission. Please make sure it is appropriate to clone someone's voice before doing so.