LTX-Video LoRA training study (Single image/style training)

Community Article Published January 14, 2025

Here's a study I made on LTX-Video lora training to better learn how training and inference settings affect the outcome. I hope it can be useful for others as well.

It's a rank 128 lora trained on single images only, using an old (actually my first) dataset made with SD 1.5. I chose it because I had it, it has a distinct style, and is small in size. I used Gemini to (re)caption one of the images, then modified the prompt for the other images. All examples should be using the same seed. I sadly picked one that prefered to move backwards.

Training was done using diffusers with finetrainers as backend

finetrainers-ui as gui (my own project) on my 3090. All in all, it took around 3 hours.

Inference was done with ComfyUI core nodes, with this PR to allow for loading the loras.

I've added a few comments below, let me know if there is more info you would like see.

Example from dataset:

A graveyard at night. The scene is shrouded in a thick fog, creating a dark and eerie atmosphere. Numerous tombstones are visible, their inscriptions barely discernible in the dim light. Two large trees stand tall in the foreground, their branches reaching out like skeletal arms. The sky is overcast with a full moon casting a pale glow over the scene. The overall impression is one of mystery and melancholy. "A graveyard at night. The scene is shrouded in a thick fog, creating a dark and eerie atmosphere. Numerous tombstones are visible, their inscriptions barely discernible in the dim light. Two large trees stand tall in the foreground, their branches reaching out like skeletal arms. The sky is overcast with a full moon casting a pale glow over the scene. The overall impression is one of mystery and melancholy."

Inspiration for the video prompt (not same as above came from here

1400 training steps

Lora strength variation, 50 inference steps, (0.55, 0.75, 0.9)

0.55 0.75 0.9

Frame variation, 50 inference steps, (73, 97, 153)

73 97 153

Lower number of frames decrease likeness. This was somewhat surprising and unfortunately means you can't quickly check out a lora with just a few frames.

Fps variation, 40 inference steps, (25, 45, 65)

73 97 153

I had seen that higher inference fps could yield more movement if having trained at lower than 24 fps. For single image training it just seems to lessen movement.

2400 training steps

Lora strength variation, 50 inference steps, (0.55, 0.75, 0.9)

0.55 0.75 0.9

Lora strength variation, 60 inference steps, (0.55, 0.75, 0.9)

0.55 0.75 0.9

Cfg variation, 50 inference steps, 0.55 lora strength (cfg 2, cfg 3, cfg 4)

Cfg 2 Cfg 3 Cfg 4

High cfg adds creativity, but lowers likeness (unsurprisingly)

97 frames comparison at different training steps (700, 1400, 2400)

700 1400 2400

For Image 2 Video, the lora doesn't do much. Results are comparable with or without lora for the same prompt and an input image.

finetrainers config.yaml

accelerate_config: uncompiled_1.yaml allow_tf32: true batch_size: 28 beta1: 0.9 beta2: 0.95 caption_column: prompts.txt caption_dropout_p: 0.05 caption_dropout_technique: empty checkpointing_limit: 10 checkpointing_steps: 100 data_root: dataloader_num_workers: 0 dataset_file: '' diffusion_options: '' enable_model_cpu_offload: '' enable_slicing: true enable_tiling: true epsilon: 1e-8 gpu_ids: '0' gradient_accumulation_steps: 1 gradient_checkpointing: true id_token: afkx image_resolution_buckets: 512x512 lora_alpha: 128 lr: 0.0002 lr_num_cycles: 1 lr_scheduler: linear lr_warmup_steps: 100 max_grad_norm: 1 mixed_precision: bf16 model_name: ltx_video nccl_timeout: 1800 num_validation_videos: 0 optimizer: adamw output_dir: '' pin_memory: true precompute_conditions: '' pretrained_model_name_or_path: '' rank: 128 report_to: none resume_from_checkpoint: '' seed: 42 target_modules: to_q to_k to_v to_out.0 text_encoder_2_dtype: bf16 text_encoder_3_dtype: bf16 text_encoder_dtype: bf16 tracker_name: finetrainers train_steps: 3000 training_type: lora use_8bit_bnb: '' vae_dtype: bf16 validation_epochs: 0 validation_prompt_separator: ':::' validation_prompts: '' validation_steps: 100 video_column: videos.txt video_resolution_buckets: 1x512x512 weight_decay: 0.001