Files changed (1) hide show
  1. README.md +27 -9
README.md CHANGED
@@ -28,15 +28,16 @@ This is the model repository for Pyramid Flow, a training-efficient **Autoregres
28
 
29
  ## News
30
 
31
- * `2024.10.29` ⚑️⚑️⚑️ We release [training code](https://github.com/jy0205/Pyramid-Flow?tab=readme-ov-file#training) and [new model checkpoints](https://huggingface.co/rain1011/pyramid-flow-miniflux) with FLUX structure trained from scratch.
32
 
33
- > We have switched the model structure from SD3 to a mini FLUX to fix human structure issues, please try our 1024p image checkpoint and 384p video checkpoint. These checkpoints are trained with synthetic data from FLUX. We will release 768p video checkpoint in a few days.
 
34
  * `2024.10.11` πŸ€—πŸ€—πŸ€— [Hugging Face demo](https://huggingface.co/spaces/Pyramid-Flow/pyramid-flow) is available. Thanks [@multimodalart](https://huggingface.co/multimodalart) for the commit!
35
  * `2024.10.10` πŸš€πŸš€πŸš€ We release the [technical report](https://arxiv.org/abs/2410.05954), [project page](https://pyramid-flow.github.io) and [model checkpoint](https://huggingface.co/rain1011/pyramid-flow-sd3) of Pyramid Flow.
36
 
37
  ## Installation
38
 
39
- We recommend setting up the environment with conda. The codebase currently uses Python 3.8.10 and PyTorch 2.1.2, and we are actively working to support a wider range of versions.
40
 
41
  ```bash
42
  git clone https://github.com/jy0205/Pyramid-Flow
@@ -48,7 +49,7 @@ conda activate pyramid
48
  pip install -r requirements.txt
49
  ```
50
 
51
- Then, download the model from [Huggingface](https://huggingface.co/rain1011) (there are two variants: [miniFLUX](https://huggingface.co/rain1011/pyramid-flow-miniflux) or [SD3](https://huggingface.co/rain1011/pyramid-flow-sd3)). The miniFLUX models support 1024p image and 384p video generation, and the SD3-based models support 768p and 384p video generation. The 384p checkpoint generates 5-second video at 24FPS, while the 768p checkpoint generates up to 10-second video at 24FPS.
52
 
53
  ```python
54
  from huggingface_hub import snapshot_download
@@ -74,8 +75,9 @@ model_dtype, torch_dtype = 'bf16', torch.bfloat16 # Use bf16 (not support fp16
74
 
75
  model = PyramidDiTForVideoGeneration(
76
  'PATH', # The downloaded checkpoint dir
 
77
  model_dtype,
78
- model_variant='diffusion_transformer_384p', # SD3 supports 'diffusion_transformer_768p'
79
  )
80
 
81
  model.vae.enable_tiling()
@@ -92,15 +94,23 @@ Then, you can try text-to-video generation on your own prompts:
92
  ```python
93
  prompt = "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors"
94
 
 
 
 
 
 
 
 
 
95
  with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
96
  frames = model.generate(
97
  prompt=prompt,
98
  num_inference_steps=[20, 20, 20],
99
  video_num_inference_steps=[10, 10, 10],
100
- height=384,
101
- width=640,
102
  temp=16, # temp=16: 5s, temp=31: 10s
103
- guidance_scale=9.0, # The guidance for the first frame, set it to 7 for 384p variant
104
  video_guidance_scale=5.0, # The guidance for the other video latent
105
  output_type="pil",
106
  save_memory=True, # If you have enough GPU memory, set it to `False` to improve vae decoding speed
@@ -112,7 +122,15 @@ export_to_video(frames, "./text_to_video_sample.mp4", fps=24)
112
  As an autoregressive model, our model also supports (text conditioned) image-to-video generation:
113
 
114
  ```python
115
- image = Image.open('assets/the_great_wall.jpg').convert("RGB").resize((640, 384))
 
 
 
 
 
 
 
 
116
  prompt = "FPV flying over the Great Wall"
117
 
118
  with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
 
28
 
29
  ## News
30
 
31
+ * `2024.11.13` πŸš€πŸš€πŸš€ We release the [768p miniFLUX checkpoint](https://huggingface.co/rain1011/pyramid-flow-miniflux) (up to 10s).
32
 
33
+ > We have switched the model structure from SD3 to a mini FLUX to fix human structure issues, please try our 1024p image checkpoint, 384p video checkpoint (up to 5s) and 768p video checkpoint (up to 10s). The new miniflux model shows great improvement on human structure and motion stability
34
+ * `2024.10.29` ⚑️⚑️⚑️ We release [training code](https://github.com/jy0205/Pyramid-Flow?tab=readme-ov-file#training) and [new model checkpoints](https://huggingface.co/rain1011/pyramid-flow-miniflux) with FLUX structure trained from scratch.
35
  * `2024.10.11` πŸ€—πŸ€—πŸ€— [Hugging Face demo](https://huggingface.co/spaces/Pyramid-Flow/pyramid-flow) is available. Thanks [@multimodalart](https://huggingface.co/multimodalart) for the commit!
36
  * `2024.10.10` πŸš€πŸš€πŸš€ We release the [technical report](https://arxiv.org/abs/2410.05954), [project page](https://pyramid-flow.github.io) and [model checkpoint](https://huggingface.co/rain1011/pyramid-flow-sd3) of Pyramid Flow.
37
 
38
  ## Installation
39
 
40
+ We recommend setting up the environment with conda. The codebase currently uses Python 3.8.10 and PyTorch 2.1.2 ([guide](https://pytorch.org/get-started/previous-versions/#v212)), and we are actively working to support a wider range of versions.
41
 
42
  ```bash
43
  git clone https://github.com/jy0205/Pyramid-Flow
 
49
  pip install -r requirements.txt
50
  ```
51
 
52
+ Then, download the model from [Huggingface](https://huggingface.co/rain1011) (there are two variants: [miniFLUX](https://huggingface.co/rain1011/pyramid-flow-miniflux) or [SD3](https://huggingface.co/rain1011/pyramid-flow-sd3)). The miniFLUX models support 1024p image, 384p and 768p video generation, and the SD3-based models support 768p and 384p video generation. The 384p checkpoint generates 5-second video at 24FPS, while the 768p checkpoint generates up to 10-second video at 24FPS.
53
 
54
  ```python
55
  from huggingface_hub import snapshot_download
 
75
 
76
  model = PyramidDiTForVideoGeneration(
77
  'PATH', # The downloaded checkpoint dir
78
+ model_name="pyramid_flux",
79
  model_dtype,
80
+ model_variant='diffusion_transformer_768p',
81
  )
82
 
83
  model.vae.enable_tiling()
 
94
  ```python
95
  prompt = "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors"
96
 
97
+ # used for 384p model variant
98
+ # width = 640
99
+ # height = 384
100
+
101
+ # used for 768p model variant
102
+ width = 1280
103
+ height = 768
104
+
105
  with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
106
  frames = model.generate(
107
  prompt=prompt,
108
  num_inference_steps=[20, 20, 20],
109
  video_num_inference_steps=[10, 10, 10],
110
+ height=height,
111
+ width=width,
112
  temp=16, # temp=16: 5s, temp=31: 10s
113
+ guidance_scale=7.0, # The guidance for the first frame, set it to 7 for 384p variant
114
  video_guidance_scale=5.0, # The guidance for the other video latent
115
  output_type="pil",
116
  save_memory=True, # If you have enough GPU memory, set it to `False` to improve vae decoding speed
 
122
  As an autoregressive model, our model also supports (text conditioned) image-to-video generation:
123
 
124
  ```python
125
+ # used for 384p model variant
126
+ # width = 640
127
+ # height = 384
128
+
129
+ # used for 768p model variant
130
+ width = 1280
131
+ height = 768
132
+
133
+ image = Image.open('assets/the_great_wall.jpg').convert("RGB").resize((width, height))
134
  prompt = "FPV flying over the Great Wall"
135
 
136
  with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):