File size: 4,416 Bytes
94728ca
 
 
 
 
 
 
 
 
 
89e323b
40fc33e
89e323b
0f30a4a
df3f1e5
 
 
24d9e55
 
 
94728ca
 
 
 
 
 
 
 
d83f813
 
94728ca
89e323b
94728ca
cbe5ff1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
license: apache-2.0
base_model:
- genmo/mochi-1-preview
pipeline_tag: text-to-video
tags:
- infinite zoom
- art style
- mochi
- diffusion
widget:
- text: Human fingers pinching to zoom on an infinite zoom canvas, a detailed cityscape at night, zoom focuses on a can, all surface around it is made of liquid and objects swimming in it.
  output:
    url: samples/4_1800.mp4
- text: Human fingers pinching to zoom on an infinite zoom canvas, spaceship going through space.
  output:
    url: samples/5_2000.mp4
- text: Human fingers pinching to zoom on an infinite zoom canvas, orange cat in the middle of a canvas, looking upward.
  output:
    url: samples/6_2000.mp4
---

# Fine-Tuning Mochi Text-to-Video: InfiniteZoom-Mochi

This project demonstrates the fine-tuning of the **Mochi Text-to-Video** model using a LoRA (Low-Rank Adaptation) approach, focusing on the **infinite zoom art style**.

## Training Details

- **Model Base**: [genmo/mochi-1-preview](https://huggingface.co/genmo/mochi-1-preview)
- **Fine-Tuning Dataset**: 23 short video clips of infinite zoom art style, and .txt descriptions
- **Training Hardware**: H100 GPU  
- **Training Duration**: 2h

<Gallery />

## lora.yaml:
```
init_checkpoint_path: /weights/dit.safetensors
checkpoint_dir: /finetunes/my_mochi_lora
train_data_dir: /videos_prepared
attention_mode: sdpa
single_video_mode: false # Useful for debugging whether your model can learn a single video

# You only need this if you're using wandb
wandb:
  # project: mochi_1_lora
  # name: ${checkpoint_dir}
  # group: null

optimizer:
  lr: 2e-4
  weight_decay: 0.01

model:
  type: lora
  kwargs:
    # Apply LoRA to the QKV projection and the output projection of the attention block.
    qkv_proj_lora_rank: 16
    qkv_proj_lora_alpha: 16
    qkv_proj_lora_dropout: 0.
    out_proj_lora_rank: 16
    out_proj_lora_alpha: 16
    out_proj_lora_dropout: 0.

training:
  model_dtype: bf16
  warmup_steps: 200
  num_qkv_checkpoint: 48
  num_ff_checkpoint: 48
  num_post_attn_checkpoint: 48
  num_steps: 2000
  save_interval: 200
  caption_dropout: 0.1
  grad_clip: 0.0
  save_safetensors: true

# Used for generating samples during training to monitor progress ...
sample:
   interval: 200
   output_dir: ${checkpoint_dir}/samples
   decoder_path: /weights/decoder.safetensors
   prompts:
      - Human fingers pinching to zoom on an infinite zoom canvas, a vast desert landscape stretches into the horizon. At the center, a giant hourglass sits, its glass exterior glinting in the sunlight. The zoom begins within the hourglass, revealing cascading grains of sand, each grain transitioning into a crystalline snowflake, leading to a frozen tundra as the scene deepens further.
      - Human fingers pinching to zoom on an infinite zoom canvas, a colossal tree rises from a lush forest, its bark covered with intricate carvings of stories. The zoom focuses on one carving, which transforms into a vibrant painting of a village. Zooming further, the village reveals bustling streets, where a single doorway becomes the entry to a glowing cosmos.
      - Human fingers pinching to zoom on an infinite zoom canvas, a tranquil ocean surface reflects the twilight sky. The zoom begins within a whirlpool, diving into vibrant coral reefs teeming with marine life. A single pearl on the ocean floor becomes the focus, transitioning into a marble palace with intricate golden inlays as the zoom continues seamlessly.
      - Human fingers pinching to zoom on an infinite zoom canvas, a glowing campfire crackles in a dense, dark forest. The zoom begins in the heart of the fire, revealing swirling embers that transition into galaxies of stars. The zoom then centers on a lone star, which transforms into a lantern hanging in a cozy mountain cabin, seamlessly revealing new layers.
      - Human fingers pinching to zoom on an infinite zoom canvas, a detailed cityscape at night, illuminated by neon lights and bustling with activity. The zoom focuses on a lit billboard advertising a soda can, transitioning into the sparkling surface of the liquid. As the zoom deepens, microscopic bubbles transform into entire ecosystems of floating islands within the soda.
   seed: 12345
   kwargs:
     height: 480
     width: 848
     num_frames: 37
     num_inference_steps: 64
     sigma_schedule_python_code: "linear_quadratic_schedule(64, 0.025)"
     cfg_schedule_python_code: "[6.0] * 64"
```