Spaces:

Gregniuki
/

f5-tts_Polish_English_German

Running

Upload 4 files

f9ff40a verified 4 days ago

701 Bytes

	## Backbones quick introduction


	### unett.py
	- flat unet transformer
	- structure same as in e2-tts & voicebox paper except using rotary pos emb
	- update: allow possible abs pos emb & convnextv2 blocks for embedded text before concat

	### dit.py
	- adaln-zero dit
	- embedded timestep as condition
	- concatted noised_input + masked_cond + embedded_text, linear proj in
	- possible abs pos emb & convnextv2 blocks for embedded text before concat
	- possible long skip connection (first layer to last layer)

	### mmdit.py
	- sd3 structure
	- timestep as condition
	- left stream: text embedded and applied a abs pos emb
	- right stream: masked_cond & noised_input concatted and with same conv pos emb as unett