GitMylo
/

bark-voice-cloning

Feature Extraction

Model card Files Files and versions Community

bark-voice-cloning / README.md

GitMylo's picture

README.md

537bec2 over 1 year ago

|

2.26 kB

	---
	license: mit
	pipeline_tag: feature-extraction
	tags:
	- bark
	- tts
	- hubert
	- text-to-speech
	---
	# Bark-voice-cloning
	Bark-voice-cloning is a model which processes the outputs from a HuBERT model, and turns them into semantic tokens compatible with bark text to speech.

	This can be used for many things, including speech transfer and voice cloning.

	# Voice cloning
	Voice cloning is creating a new voice for text-to-speech.

	Process:
	1. Load your wav audio file into your pytorch application
	2. For the fine prompt extract [discrete representations](https://github.com/facebookresearch/encodec#extracting-discrete-representations). (These are used by bark to know about the voice)
	3. For the coarse prompt do `fine_prompt[:2, :]`, to get the coarse prompt from a fine prompt.
	4. For the semantics, load a HuBERT model without Kmeans (I personally use the [audiolm-pytorch](https://github.com/lucidrains/audiolm-pytorch) implementation's hubertwithkmeans, but i edited it to skip kmeans.)
	5. Next, to get the actual semantic tokens, run the tokens through this model. Your output will be compatible with bark.
	6. Save these files in an npz with `numpy.savez(semantic_prompt=semantics, fine_prompt=fine, coarse_prompt=coarse)`. This is your speaker file containing your cloned voice.

	# Voice masking
	Voice masking is replacing a voice in an audio clip for speech-to-speech.

	## Random
	Replacing a voice in an audio clip with a voice generated by bark.

	process:
	1. Extract semantics from the audio clip using HuBERT and this model
	2. Run `semantic_to_waveform` from `bark.api` with the extracted semantics
	3. The previous step returns the generated audio.

	## Transfer
	Replacing a voice with a voice from another audio clip.

	process:
	1. Create a speaker file using the steps under the voice cloning section
	2. Extract the semantics from the clip with the text you want spoken
	3. Run `semantics_to_waveform` from `bark.api` with the extracted semantics, and the speaker prompt that you created on step 1.
	4. The previous step returns the generated audio.

	# Disclaimer
	I am not responsible for any misuse of this model. I do not agree with cloning people's voices without permission. Please make sure it is appropriate to clone someone's voice before doing so.