1000 rip roaring steps of adamw8bit on a trajectory the likes of which have never been seen before:

SDXL_1.0_0.9_V-PREDICTION_SQCU_SNR

this is one f for the history books

Model Description:

❀ developed by: @sameqcu xitter, @sqcu.bsky.social.
❀ model type: ||alpha(t)*eps - sigma(t)*x - v_hat_theta(z_t)||(2,2) mean squared error, variance preserving, k-sigmoid-loss-weighted denoising diffusion model.
❀ license: dm me if you actually think you want clearance to use this model for something. this is not a joke.
❀ mdel description: the computer takes a picture and it pretends the pixture is noise and then the compuiter predtendsn the image is turning into not onnoise and then it pretends it —
you briefly wake up from the dulled senses and insincere state of mind traditional for the 'model card reader'.
why are you reading a model card? in what possible sense can the computer program be described if its a tensor program?
the model describes itself, through action, or the only compact description of the model is the model itself.
go read baudrillard or smth. istg.
❀ resources for more information: fat-fingered public github repository extending popular 'gui frontend' diffusion model code to support training noise density function
∟❀ technical report Soon.

Ablation & Sous Rature:

figre 1: inference under training noise density function; β(0):0.00085,β(T):0.024, beta schedule shifted to zsnr. figre 2: inference under training noise density function; β(0):0.00085,β(T):0.024, beta schedule not shifted to zsnr. figre 3: inference under 'pretrain' noise density function; β(0):0.00085,β(T):0.012, beta schedule not shifted to zsnr. figre 4: inference evaluating model outputs as 'epsilon predictions', 'pretrain' noise density function; β(0):0.00085,β(T):0.012, beta schedule not shifted to zsnr.

thru figres 1->4, we demonstrate that, irrespective of the aesthetic value or dataset adherence of the sampled model, the model is definitely uptrained to the 'v prediction' target.
these ablations furthermore illustrate the surprising mismatch in optimization difficulty of 'choosing prediction target' vs 'modeling images'.
if rectified flow prediction targets had a coherent philosophical or empirical justification for their use, perhaps we would have 'uptrained' to those targets as well.
however, understanding... elbo section d.3.3 argues that the rectified flows prediction target is v-prediction, not even 'v-prediction subject to a loss-weighting fn w(λ)'.
we agree.

Applications:

and why this model is truly a research topic rather than viral marketing for someone's desperate third+ raise round:

this is denoising diffusion model uptrained to an alternate prediction target for 1000 steps on batchsize=4. this is not an inference model. do not under any circumstances sample from this denoising diffusion model's denoising predictions. that is a bad idea and i have no interest in supporting 'inference' of these released model weights.

instead, train on them. i recommend a modest optimizer stepsize of strictly less than 1e-05, adam β1,β2 of (0.9, 0.95) -> (0.95, 0.99). implement a sigmoid-k loss weighting as in understanding... elbo. we used k=5, as other weighting schemes reliably induce exploding gradients. the released model was trained on full bf16 precision so if your training runs are blowups rather than glowups, you have made some subtantial and serious configuration error, rather than discovered the stunning and surprising flaw with floating point precision.
finally:
❀ you absolutely must train this model on the v prediction target!
∟ i will not explain this further!
∟ it is in diffusers!
❀ you absolutely must change the betas schedule for your model to match this pretrain's!
∟ beta_start: float = 0.00085,
∟ beta_end: float = 0.024,
❀ you absolutely must rescale the betas to zero terminal snr!
∟ this is also in diffusers dppm

if you find a way to get 2024+-tech-level image generation training to converge using alternate optimizers, e.g. the adafactor family, adam-mini, or flora-opt, please get in touch, i'm really curious!

prohibited uses:

SQCU
/

sd_xl_base_1.0_0.9_16bit_vpred_sqcusnr