Update README.md
#6
by
feifeiobama
- opened
README.md
CHANGED
@@ -28,15 +28,16 @@ This is the model repository for Pyramid Flow, a training-efficient **Autoregres
|
|
28 |
|
29 |
## News
|
30 |
|
31 |
-
* `2024.
|
32 |
|
33 |
-
> We have switched the model structure from SD3 to a mini FLUX to fix human structure issues, please try our 1024p image checkpoint
|
|
|
34 |
* `2024.10.11` π€π€π€ [Hugging Face demo](https://huggingface.co/spaces/Pyramid-Flow/pyramid-flow) is available. Thanks [@multimodalart](https://huggingface.co/multimodalart) for the commit!
|
35 |
* `2024.10.10` πππ We release the [technical report](https://arxiv.org/abs/2410.05954), [project page](https://pyramid-flow.github.io) and [model checkpoint](https://huggingface.co/rain1011/pyramid-flow-sd3) of Pyramid Flow.
|
36 |
|
37 |
## Installation
|
38 |
|
39 |
-
We recommend setting up the environment with conda. The codebase currently uses Python 3.8.10 and PyTorch 2.1.2, and we are actively working to support a wider range of versions.
|
40 |
|
41 |
```bash
|
42 |
git clone https://github.com/jy0205/Pyramid-Flow
|
@@ -48,7 +49,7 @@ conda activate pyramid
|
|
48 |
pip install -r requirements.txt
|
49 |
```
|
50 |
|
51 |
-
Then, download the model from [Huggingface](https://huggingface.co/rain1011) (there are two variants: [miniFLUX](https://huggingface.co/rain1011/pyramid-flow-miniflux) or [SD3](https://huggingface.co/rain1011/pyramid-flow-sd3)). The miniFLUX models support 1024p image and
|
52 |
|
53 |
```python
|
54 |
from huggingface_hub import snapshot_download
|
@@ -74,8 +75,9 @@ model_dtype, torch_dtype = 'bf16', torch.bfloat16 # Use bf16 (not support fp16
|
|
74 |
|
75 |
model = PyramidDiTForVideoGeneration(
|
76 |
'PATH', # The downloaded checkpoint dir
|
|
|
77 |
model_dtype,
|
78 |
-
model_variant='
|
79 |
)
|
80 |
|
81 |
model.vae.enable_tiling()
|
@@ -92,15 +94,23 @@ Then, you can try text-to-video generation on your own prompts:
|
|
92 |
```python
|
93 |
prompt = "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors"
|
94 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
95 |
with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
|
96 |
frames = model.generate(
|
97 |
prompt=prompt,
|
98 |
num_inference_steps=[20, 20, 20],
|
99 |
video_num_inference_steps=[10, 10, 10],
|
100 |
-
height=
|
101 |
-
width=
|
102 |
temp=16, # temp=16: 5s, temp=31: 10s
|
103 |
-
guidance_scale=
|
104 |
video_guidance_scale=5.0, # The guidance for the other video latent
|
105 |
output_type="pil",
|
106 |
save_memory=True, # If you have enough GPU memory, set it to `False` to improve vae decoding speed
|
@@ -112,7 +122,15 @@ export_to_video(frames, "./text_to_video_sample.mp4", fps=24)
|
|
112 |
As an autoregressive model, our model also supports (text conditioned) image-to-video generation:
|
113 |
|
114 |
```python
|
115 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
116 |
prompt = "FPV flying over the Great Wall"
|
117 |
|
118 |
with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
|
|
|
28 |
|
29 |
## News
|
30 |
|
31 |
+
* `2024.11.13` πππ We release the [768p miniFLUX checkpoint](https://huggingface.co/rain1011/pyramid-flow-miniflux) (up to 10s).
|
32 |
|
33 |
+
> We have switched the model structure from SD3 to a mini FLUX to fix human structure issues, please try our 1024p image checkpoint, 384p video checkpoint (up to 5s) and 768p video checkpoint (up to 10s). The new miniflux model shows great improvement on human structure and motion stability
|
34 |
+
* `2024.10.29` β‘οΈβ‘οΈβ‘οΈ We release [training code](https://github.com/jy0205/Pyramid-Flow?tab=readme-ov-file#training) and [new model checkpoints](https://huggingface.co/rain1011/pyramid-flow-miniflux) with FLUX structure trained from scratch.
|
35 |
* `2024.10.11` π€π€π€ [Hugging Face demo](https://huggingface.co/spaces/Pyramid-Flow/pyramid-flow) is available. Thanks [@multimodalart](https://huggingface.co/multimodalart) for the commit!
|
36 |
* `2024.10.10` πππ We release the [technical report](https://arxiv.org/abs/2410.05954), [project page](https://pyramid-flow.github.io) and [model checkpoint](https://huggingface.co/rain1011/pyramid-flow-sd3) of Pyramid Flow.
|
37 |
|
38 |
## Installation
|
39 |
|
40 |
+
We recommend setting up the environment with conda. The codebase currently uses Python 3.8.10 and PyTorch 2.1.2 ([guide](https://pytorch.org/get-started/previous-versions/#v212)), and we are actively working to support a wider range of versions.
|
41 |
|
42 |
```bash
|
43 |
git clone https://github.com/jy0205/Pyramid-Flow
|
|
|
49 |
pip install -r requirements.txt
|
50 |
```
|
51 |
|
52 |
+
Then, download the model from [Huggingface](https://huggingface.co/rain1011) (there are two variants: [miniFLUX](https://huggingface.co/rain1011/pyramid-flow-miniflux) or [SD3](https://huggingface.co/rain1011/pyramid-flow-sd3)). The miniFLUX models support 1024p image, 384p and 768p video generation, and the SD3-based models support 768p and 384p video generation. The 384p checkpoint generates 5-second video at 24FPS, while the 768p checkpoint generates up to 10-second video at 24FPS.
|
53 |
|
54 |
```python
|
55 |
from huggingface_hub import snapshot_download
|
|
|
75 |
|
76 |
model = PyramidDiTForVideoGeneration(
|
77 |
'PATH', # The downloaded checkpoint dir
|
78 |
+
model_name="pyramid_flux",
|
79 |
model_dtype,
|
80 |
+
model_variant='diffusion_transformer_768p',
|
81 |
)
|
82 |
|
83 |
model.vae.enable_tiling()
|
|
|
94 |
```python
|
95 |
prompt = "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors"
|
96 |
|
97 |
+
# used for 384p model variant
|
98 |
+
# width = 640
|
99 |
+
# height = 384
|
100 |
+
|
101 |
+
# used for 768p model variant
|
102 |
+
width = 1280
|
103 |
+
height = 768
|
104 |
+
|
105 |
with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
|
106 |
frames = model.generate(
|
107 |
prompt=prompt,
|
108 |
num_inference_steps=[20, 20, 20],
|
109 |
video_num_inference_steps=[10, 10, 10],
|
110 |
+
height=height,
|
111 |
+
width=width,
|
112 |
temp=16, # temp=16: 5s, temp=31: 10s
|
113 |
+
guidance_scale=7.0, # The guidance for the first frame, set it to 7 for 384p variant
|
114 |
video_guidance_scale=5.0, # The guidance for the other video latent
|
115 |
output_type="pil",
|
116 |
save_memory=True, # If you have enough GPU memory, set it to `False` to improve vae decoding speed
|
|
|
122 |
As an autoregressive model, our model also supports (text conditioned) image-to-video generation:
|
123 |
|
124 |
```python
|
125 |
+
# used for 384p model variant
|
126 |
+
# width = 640
|
127 |
+
# height = 384
|
128 |
+
|
129 |
+
# used for 768p model variant
|
130 |
+
width = 1280
|
131 |
+
height = 768
|
132 |
+
|
133 |
+
image = Image.open('assets/the_great_wall.jpg').convert("RGB").resize((width, height))
|
134 |
prompt = "FPV flying over the Great Wall"
|
135 |
|
136 |
with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
|