radames commited on
Commit
da1edac
·
1 Parent(s): 25718a3
Files changed (12) hide show
  1. README.md +255 -12
  2. app.py +158 -0
  3. model.py +147 -0
  4. patch +169 -0
  5. pose_images/000.png +0 -0
  6. pose_images/001.png +0 -0
  7. pose_images/002.png +0 -0
  8. pose_images/003.png +0 -0
  9. pose_images/004.png +0 -0
  10. pose_images/005.png +0 -0
  11. requirements.txt +12 -0
  12. style.css +16 -0
README.md CHANGED
@@ -1,12 +1,255 @@
1
- ---
2
- title: Text2Human API
3
- emoji: 📈
4
- colorFrom: gray
5
- colorTo: gray
6
- sdk: gradio
7
- sdk_version: 3.0.22
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Text2Human - Official PyTorch Implementation
2
+
3
+ <!-- <img src="./doc_images/overview.jpg" width="96%" height="96%"> -->
4
+
5
+ This repository provides the official PyTorch implementation for the following paper:
6
+
7
+ **Text2Human: Text-Driven Controllable Human Image Generation**</br>
8
+ [Yuming Jiang](https://yumingj.github.io/), [Shuai Yang](https://williamyang1991.github.io/), [Haonan Qiu](http://haonanqiu.com/), [Wayne Wu](https://dblp.org/pid/50/8731.html), [Chen Change Loy](https://www.mmlab-ntu.com/person/ccloy/) and [Ziwei Liu](https://liuziwei7.github.io/)</br>
9
+ In ACM Transactions on Graphics (Proceedings of SIGGRAPH), 2022.
10
+
11
+ From [MMLab@NTU](https://www.mmlab-ntu.com/index.html) affliated with S-Lab, Nanyang Technological University and SenseTime Research.
12
+
13
+ <table>
14
+ <tr>
15
+ <td><img src="assets/1.png" width="100%"/></td>
16
+ <td><img src="assets/2.png" width="100%"/></td>
17
+ <td><img src="assets/3.png" width="100%"/></td>
18
+ <td><img src="assets/4.png" width="100%"/></td>
19
+ </tr>
20
+ <tr>
21
+ <td align='center' width='24%'>The lady wears a short-sleeve T-shirt with pure color pattern, and a short and denim skirt.</td>
22
+ <td align='center' width='24%'>The man wears a long and floral shirt, and long pants with the pure color pattern.</td>
23
+ <td align='center' width='24%'>A lady is wearing a sleeveless pure-color shirt and long jeans</td>
24
+ <td align='center' width='24%'>The man wears a short-sleeve T-shirt with the pure color pattern and a short pants with the pure color pattern.</td>
25
+ <tr>
26
+ </table>
27
+
28
+ [**[Project Page]**](https://yumingj.github.io/projects/Text2Human.html) | [**[Paper]**](https://arxiv.org/pdf/2205.15996.pdf) | [**[Dataset]**](https://github.com/yumingj/DeepFashion-MultiModal) | [**[Demo Video]**](https://youtu.be/yKh4VORA_E0)
29
+
30
+
31
+ ## Updates
32
+
33
+ - [05/2022] Paper and demo video are released.
34
+ - [05/2022] Code is released.
35
+ - [05/2022] This website is created.
36
+
37
+ ## Installation
38
+ **Clone this repo:**
39
+ ```bash
40
+ git clone https://github.com/yumingj/Text2Human.git
41
+ cd Text2Human
42
+ ```
43
+ **Dependencies:**
44
+
45
+ All dependencies for defining the environment are provided in `environment/text2human_env.yaml`.
46
+ We recommend using [Anaconda](https://docs.anaconda.com/anaconda/install/) to manage the python environment:
47
+ ```bash
48
+ conda env create -f ./environment/text2human_env.yaml
49
+ conda activate text2human
50
+ conda install -c huggingface tokenizers=0.9.4
51
+ conda install -c huggingface transformers=4.0.0
52
+ conda install -c conda-forge sentence-transformers=2.0.0
53
+ ```
54
+
55
+ If it doesn't work, you may need to install the following packages on your own:
56
+ - Python 3.6
57
+ - PyTorch 1.7.1
58
+ - CUDA 10.1
59
+ - [sentence-transformers](https://huggingface.co/sentence-transformers) 2.0.0
60
+ - [tokenizers](https://pypi.org/project/tokenizers/) 0.9.4
61
+ - [transformers](https://huggingface.co/docs/transformers/installation) 4.0.0
62
+
63
+ ## (1) Dataset Preparation
64
+
65
+ In this work, we contribute a large-scale high-quality dataset with rich multi-modal annotations named [DeepFashion-MultiModal](https://github.com/yumingj/DeepFashion-MultiModal) Dataset.
66
+ Here we pre-processed the raw annotations of the original dataset for the task of text-driven controllable human image generation. The pre-processing pipeline consists of:
67
+ - align the human body in the center of the images according to the human pose
68
+ - fuse the clothing color and clothing fabric annotations into one texture annotation
69
+ - do some annotation cleaning and image filtering
70
+ - split the whole dataset into the training set and testing set
71
+
72
+ You can download our processed dataset from this [Google Drive](https://drive.google.com/file/d/1KIoFfRZNQVn6RV_wTxG2wZmY8f2T_84B/view?usp=sharing). If you want to access the raw annotations, please refer to the [DeepFashion-MultiModal](https://github.com/yumingj/DeepFashion-MultiModal) Dataset.
73
+
74
+ After downloading the dataset, unzip the file and put them under the dataset folder with the following structure:
75
+ ```
76
+ ./datasets
77
+ ├── train_images
78
+ ├── xxx.png
79
+ ...
80
+ ├── xxx.png
81
+ └── xxx.png
82
+ ├── test_images
83
+ % the same structure as in train_images
84
+ ├── densepose
85
+ % the same structure as in train_images
86
+ ├── segm
87
+ % the same structure as in train_images
88
+ ├── shape_ann
89
+ ├── test_ann_file.txt
90
+ ├── train_ann_file.txt
91
+ └── val_ann_file.txt
92
+ └── texture_ann
93
+ ├── test
94
+ ├── lower_fused.txt
95
+ ├── outer_fused.txt
96
+ └── upper_fused.txt
97
+ ├── train
98
+ % the same files as in test
99
+ └── val
100
+ % the same files as in test
101
+ ```
102
+
103
+ ## (2) Sampling
104
+
105
+ ### Inference Notebook
106
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" height=22.5></a></br>
107
+ Coming soon.
108
+
109
+
110
+ ### Pretrained Models
111
+
112
+ Pretrained models can be downloaded from this [Google Drive](https://drive.google.com/file/d/1VyI8_AbPwAUaZJPaPba8zxsFIWumlDen/view?usp=sharing). Unzip the file and put them under the dataset folder with the following structure:
113
+ ```
114
+ pretrained_models
115
+ ├── index_pred_net.pth
116
+ ├── parsing_gen.pth
117
+ ├── parsing_token.pth
118
+ ├── sampler.pth
119
+ ├── vqvae_bottom.pth
120
+ └── vqvae_top.pth
121
+ ```
122
+
123
+ ### Generation from Paring Maps
124
+ You can generate images from given parsing maps and pre-defined texture annotations:
125
+ ```python
126
+ python sample_from_parsing.py -opt ./configs/sample_from_parsing.yml
127
+ ```
128
+ The results are saved in the folder `./results/sampling_from_parsing`.
129
+
130
+ ### Generation from Poses
131
+ You can generate images from given human poses and pre-defined clothing shape and texture annotations:
132
+ ```python
133
+ python sample_from_pose.py -opt ./configs/sample_from_pose.yml
134
+ ```
135
+
136
+ **Remarks**: The above two scripts generate images without language interactions. If you want to generate images using texts, you can use the notebook or our user interface.
137
+
138
+ ### User Interface
139
+
140
+ ```python
141
+ python ui_demo.py
142
+ ```
143
+ <img src="./assets/ui.png" width="100%">
144
+
145
+ The descriptions for shapes should follow the following format:
146
+ ```
147
+ <gender>, <sleeve length>, <length of lower clothing>, <outer clothing type>, <other accessories1>, ...
148
+
149
+ Note: The outer clothing type and accessories can be omitted.
150
+
151
+ Examples:
152
+ man, sleeveless T-shirt, long pants
153
+ woman, short-sleeve T-shirt, short jeans
154
+ ```
155
+
156
+ The descriptions for textures should follow the following format:
157
+ ```
158
+ <upper clothing texture>, <lower clothing texture>, <outer clothing texture>
159
+
160
+ Note: Currently, we only support 5 types of textures, i.e., pure color, stripe/spline, plaid/lattice,
161
+ floral, denim. Your inputs should be restricted to these textures.
162
+ ```
163
+
164
+ ## (3) Training Text2Human
165
+
166
+ ### Stage I: Pose to Parsing
167
+ Train the parsing generation network. If you want to skip the training of this network, you can download our pretrained model from [here](https://drive.google.com/file/d/1MNyFLGqIQcOMg_HhgwCmKqdwfQSjeg_6/view?usp=sharing).
168
+ ```python
169
+ python train_parsing_gen.py -opt ./configs/parsing_gen.yml
170
+ ```
171
+
172
+ ### Stage II: Parsing to Human
173
+
174
+ **Step 1: Train the top level of the hierarchical VQVAE.**
175
+ We provide our pretrained model [here](https://drive.google.com/file/d/1TwypUg85gPFJtMwBLUjVS66FKR3oaTz8/view?usp=sharing). This model is trained by:
176
+ ```python
177
+ python train_vqvae.py -opt ./configs/vqvae_top.yml
178
+ ```
179
+
180
+ **Step 2: Train the bottom level of the hierarchical VQVAE.**
181
+ We provide our pretrained model [here](https://drive.google.com/file/d/15hzbY-RG-ILgzUqqGC0qMzlS4OayPdRH/view?usp=sharing). This model is trained by:
182
+ ```python
183
+ python train_vqvae.py -opt ./configs/vqvae_bottom.yml
184
+ ```
185
+
186
+ **Stage 3 & 4: Train the sampler with mixture-of-experts.** To train the sampler, we first need to train a model to tokenize the parsing maps. You can access our pretrained parsing maps [here](https://drive.google.com/file/d/1GLHoOeCP6sMao1-R63ahJMJF7-J00uir/view?usp=sharing).
187
+ ```python
188
+ python train_parsing_token.py -opt ./configs/parsing_token.yml
189
+ ```
190
+
191
+ With the parsing tokenization model, the sampler is trained by:
192
+ ```python
193
+ python train_sampler.py -opt ./configs/sampler.yml
194
+ ```
195
+ Our pretrained sampler is provided [here](https://drive.google.com/file/d/1OQO_kG2fK7eKiG1VJH1OL782X71UQAmS/view?usp=sharing).
196
+
197
+ **Stage 5: Train the index prediction network.**
198
+ We provide our pretrained index prediction network [here](https://drive.google.com/file/d/1rqhkQD-JGd7YBeIfDvMV-vjfbNHpIhYm/view?usp=sharing). It is trained by:
199
+ ```python
200
+ python train_index_prediction.py -opt ./configs/index_pred_net.yml
201
+ ```
202
+
203
+
204
+ **Remarks**: In the config files, we use the path to our models as the required pretrained models. If you want to train the models from scratch, please replace the path to your own one. We set the numbers of the training epochs as large numbers and you can choose the best epoch for each model. For your reference, our pretrained parsing generation network is trained for 50 epochs, top-level VQVAE is trained for 135 epochs, bottom-level VQVAE is trained for 70 epochs, parsing tokenization network is trained for 20 epochs, sampler is trained for 95 epochs, and the index prediction network is trained for 70 epochs.
205
+
206
+ ## (4) Results
207
+
208
+ Please visit our [Project Page](https://yumingj.github.io/projects/Text2Human.html#results) to view more results.</br>
209
+ You can select the attribtues to customize the desired human images.
210
+ [<img src="./assets/results.png" width="90%">
211
+ ](https://yumingj.github.io/projects/Text2Human.html#results)
212
+
213
+ ## DeepFashion-MultiModal Dataset
214
+
215
+ <img src="./assets/dataset_logo.png" width="90%">
216
+
217
+ In this work, we also propose **DeepFashion-MultiModal**, a large-scale high-quality human dataset with rich multi-modal annotations. It has the following properties:
218
+ 1. It contains 44,096 high-resolution human images, including 12,701 full body human images.
219
+ 2. For each full body images, we **manually annotate** the human parsing labels of 24 classes.
220
+ 3. For each full body images, we **manually annotate** the keypoints.
221
+ 4. We extract DensePose for each human image.
222
+ 5. Each image is **manually annotated** with attributes for both clothes shapes and textures.
223
+ 6. We provide a textual description for each image.
224
+
225
+ <img src="./assets/dataset_overview.png" width="100%">
226
+
227
+ Please refer to [this repo](https://github.com/yumingj/DeepFashion-MultiModal) for more details about our proposed dataset.
228
+
229
+ ## TODO List
230
+
231
+ - [ ] Release 1024x512 version of Text2Human.
232
+ - [ ] Train the Text2Human using [SHHQ dataset](https://stylegan-human.github.io/).
233
+
234
+ ## Citation
235
+
236
+ If you find this work useful for your research, please consider citing our paper:
237
+
238
+ ```bibtex
239
+ @article{jiang2022text2human,
240
+ title={Text2Human: Text-Driven Controllable Human Image Generation},
241
+ author={Jiang, Yuming and Yang, Shuai and Qiu, Haonan and Wu, Wayne and Loy, Chen Change and Liu, Ziwei},
242
+ journal={ACM Transactions on Graphics (TOG)},
243
+ volume={41},
244
+ number={4},
245
+ articleno={162},
246
+ pages={1--11},
247
+ year={2022},
248
+ publisher={ACM New York, NY, USA},
249
+ doi={10.1145/3528223.3530104},
250
+ }
251
+ ```
252
+
253
+ ## Acknowledgments
254
+
255
+ Part of the code is borrowed from [unleashing-transformers](https://github.com/samb-t/unleashing-transformers), [taming-transformers](https://github.com/CompVis/taming-transformers) and [mmsegmentation](https://github.com/open-mmlab/mmsegmentation).
app.py ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+
3
+ from __future__ import annotations
4
+
5
+ import argparse
6
+ import os
7
+ import pathlib
8
+ import subprocess
9
+
10
+ import gradio as gr
11
+
12
+ if os.getenv('SYSTEM') == 'spaces':
13
+ import mim
14
+
15
+ mim.uninstall('mmcv-full', confirm_yes=True)
16
+ mim.install('mmcv-full==1.5.2', is_yes=True)
17
+
18
+ with open('patch') as f:
19
+ subprocess.run('patch -p1'.split(), cwd='Text2Human', stdin=f)
20
+
21
+ from model import Model
22
+
23
+ DESCRIPTION = '''# Text2Human
24
+
25
+ This is an unofficial demo for <a href="https://github.com/yumingj/Text2Human">https://github.com/yumingj/Text2Human</a> made by <a href="https://huggingface.co/spaces/hysts/Text2Human">@hysts</a>.
26
+ You can modify sample steps and seeds. By varying seeds, you can sample different human images under the same pose, shape description, and texture description. The larger the sample steps, the better quality of the generated images. (The default value of sample steps is 256 in the original repo.)
27
+
28
+ Label image generation step can be skipped. However, in that case, the input label image must be 512x256 in size and must contain only the specified colors.
29
+ '''
30
+ FOOTER = '<img id="visitor-badge" alt="visitor badge" src="https://visitor-badge.glitch.me/badge?page_id=hysts.text2human" />'
31
+
32
+
33
+ def parse_args() -> argparse.Namespace:
34
+ parser = argparse.ArgumentParser()
35
+ parser.add_argument('--device', type=str, default='cpu')
36
+ parser.add_argument('--theme', type=str)
37
+ parser.add_argument('--share', action='store_true')
38
+ parser.add_argument('--port', type=int)
39
+ parser.add_argument('--disable-queue',
40
+ dest='enable_queue',
41
+ action='store_false')
42
+ return parser.parse_args()
43
+
44
+
45
+ def set_example_image(example: list) -> dict:
46
+ return gr.Image.update(value=example[0])
47
+
48
+
49
+ def set_example_text(example: list) -> dict:
50
+ return gr.Textbox.update(value=example[0])
51
+
52
+
53
+ def main():
54
+ args = parse_args()
55
+ model = Model(args.device)
56
+
57
+ with gr.Blocks(theme=args.theme, css='style.css') as demo:
58
+ gr.Markdown(DESCRIPTION)
59
+
60
+ with gr.Row():
61
+ with gr.Column():
62
+ with gr.Row():
63
+ input_image = gr.Image(label='Input Pose Image',
64
+ type='pil',
65
+ elem_id='input-image')
66
+ pose_data = gr.Variable()
67
+ with gr.Row():
68
+ paths = sorted(pathlib.Path('pose_images').glob('*.png'))
69
+ example_images = gr.Dataset(components=[input_image],
70
+ samples=[[path.as_posix()]
71
+ for path in paths])
72
+
73
+ with gr.Row():
74
+ shape_text = gr.Textbox(
75
+ label='Shape Description',
76
+ placeholder=
77
+ '''<gender>, <sleeve length>, <length of lower clothing>, <outer clothing type>, <other accessories1>, ...
78
+ Note: The outer clothing type and accessories can be omitted.''')
79
+ with gr.Row():
80
+ shape_example_texts = gr.Dataset(
81
+ components=[shape_text],
82
+ samples=[['man, sleeveless T-shirt, long pants'],
83
+ ['woman, short-sleeve T-shirt, short jeans']])
84
+ with gr.Row():
85
+ generate_label_button = gr.Button('Generate Label Image')
86
+
87
+ with gr.Column():
88
+ with gr.Row():
89
+ label_image = gr.Image(label='Label Image',
90
+ type='numpy',
91
+ elem_id='label-image')
92
+
93
+ with gr.Row():
94
+ texture_text = gr.Textbox(
95
+ label='Texture Description',
96
+ placeholder=
97
+ '''<upper clothing texture>, <lower clothing texture>, <outer clothing texture>
98
+ Note: Currently, only 5 types of textures are supported, i.e., pure color, stripe/spline, plaid/lattice, floral, denim.'''
99
+ )
100
+ with gr.Row():
101
+ texture_example_texts = gr.Dataset(
102
+ components=[texture_text],
103
+ samples=[['pure color, denim'], ['floral, stripe']])
104
+ with gr.Row():
105
+ sample_steps = gr.Slider(10,
106
+ 300,
107
+ value=10,
108
+ step=10,
109
+ label='Sample Steps')
110
+ with gr.Row():
111
+ seed = gr.Slider(0, 1000000, value=0, step=1, label='Seed')
112
+ with gr.Row():
113
+ generate_human_button = gr.Button('Generate Human')
114
+
115
+ with gr.Column():
116
+ with gr.Row():
117
+ result = gr.Image(label='Result',
118
+ type='numpy',
119
+ elem_id='result-image')
120
+
121
+ gr.Markdown(FOOTER)
122
+
123
+ input_image.change(fn=model.process_pose_image,
124
+ inputs=input_image,
125
+ outputs=pose_data)
126
+ generate_label_button.click(fn=model.generate_label_image,
127
+ inputs=[
128
+ pose_data,
129
+ shape_text,
130
+ ],
131
+ outputs=label_image)
132
+ generate_human_button.click(fn=model.generate_human,
133
+ inputs=[
134
+ label_image,
135
+ texture_text,
136
+ sample_steps,
137
+ seed,
138
+ ],
139
+ outputs=result)
140
+ example_images.click(fn=set_example_image,
141
+ inputs=example_images,
142
+ outputs=example_images.components)
143
+ shape_example_texts.click(fn=set_example_text,
144
+ inputs=shape_example_texts,
145
+ outputs=shape_example_texts.components)
146
+ texture_example_texts.click(fn=set_example_text,
147
+ inputs=texture_example_texts,
148
+ outputs=texture_example_texts.components)
149
+
150
+ demo.launch(
151
+ enable_queue=args.enable_queue,
152
+ server_port=args.port,
153
+ share=args.share,
154
+ )
155
+
156
+
157
+ if __name__ == '__main__':
158
+ main()
model.py ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import os
4
+ import pathlib
5
+ import sys
6
+ import zipfile
7
+
8
+ import huggingface_hub
9
+ import numpy as np
10
+ import PIL.Image
11
+ import torch
12
+
13
+ sys.path.insert(0, 'Text2Human')
14
+
15
+ from models.sample_model import SampleFromPoseModel
16
+ from utils.language_utils import (generate_shape_attributes,
17
+ generate_texture_attributes)
18
+ from utils.options import dict_to_nonedict, parse
19
+ from utils.util import set_random_seed
20
+
21
+ COLOR_LIST = [
22
+ (0, 0, 0),
23
+ (255, 250, 250),
24
+ (220, 220, 220),
25
+ (250, 235, 215),
26
+ (255, 250, 205),
27
+ (211, 211, 211),
28
+ (70, 130, 180),
29
+ (127, 255, 212),
30
+ (0, 100, 0),
31
+ (50, 205, 50),
32
+ (255, 255, 0),
33
+ (245, 222, 179),
34
+ (255, 140, 0),
35
+ (255, 0, 0),
36
+ (16, 78, 139),
37
+ (144, 238, 144),
38
+ (50, 205, 174),
39
+ (50, 155, 250),
40
+ (160, 140, 88),
41
+ (213, 140, 88),
42
+ (90, 140, 90),
43
+ (185, 210, 205),
44
+ (130, 165, 180),
45
+ (225, 141, 151),
46
+ ]
47
+
48
+
49
+ class Model:
50
+ def __init__(self, device: str):
51
+ self.config = self._load_config()
52
+ self.config['device'] = device
53
+ self._download_models()
54
+ self.model = SampleFromPoseModel(self.config)
55
+ self.model.batch_size = 1
56
+
57
+ def _load_config(self) -> dict:
58
+ path = 'Text2Human/configs/sample_from_pose.yml'
59
+ config = parse(path, is_train=False)
60
+ config = dict_to_nonedict(config)
61
+ return config
62
+
63
+ def _download_models(self) -> None:
64
+ model_dir = pathlib.Path('pretrained_models')
65
+ if model_dir.exists():
66
+ return
67
+ token = os.getenv('HF_TOKEN')
68
+ path = huggingface_hub.hf_hub_download('yumingj/Text2Human',
69
+ 'pretrained_models.zip',
70
+ use_auth_token=token)
71
+ model_dir.mkdir()
72
+ with zipfile.ZipFile(path) as f:
73
+ f.extractall(model_dir)
74
+
75
+ @staticmethod
76
+ def preprocess_pose_image(image: PIL.Image.Image) -> torch.Tensor:
77
+ image = np.array(
78
+ image.resize(
79
+ size=(256, 512),
80
+ resample=PIL.Image.Resampling.LANCZOS))[:, :, 2:].transpose(
81
+ 2, 0, 1).astype(np.float32)
82
+ image = image / 12. - 1
83
+ data = torch.from_numpy(image).unsqueeze(1)
84
+ return data
85
+
86
+ @staticmethod
87
+ def process_mask(mask: np.ndarray) -> np.ndarray:
88
+ if mask.shape != (512, 256, 3):
89
+ return None
90
+ seg_map = np.full(mask.shape[:-1], -1)
91
+ for index, color in enumerate(COLOR_LIST):
92
+ seg_map[np.sum(mask == color, axis=2) == 3] = index
93
+ if not (seg_map != -1).all():
94
+ return None
95
+ return seg_map
96
+
97
+ @staticmethod
98
+ def postprocess(result: torch.Tensor) -> np.ndarray:
99
+ result = result.permute(0, 2, 3, 1)
100
+ result = result.detach().cpu().numpy()
101
+ result = result * 255
102
+ result = np.asarray(result[0, :, :, :], dtype=np.uint8)
103
+ return result
104
+
105
+ def process_pose_image(self, pose_image: PIL.Image.Image) -> torch.Tensor:
106
+ if pose_image is None:
107
+ return
108
+ data = self.preprocess_pose_image(pose_image)
109
+ self.model.feed_pose_data(data)
110
+ return data
111
+
112
+ def generate_label_image(self, pose_data: torch.Tensor,
113
+ shape_text: str) -> np.ndarray:
114
+ if pose_data is None:
115
+ return
116
+ self.model.feed_pose_data(pose_data)
117
+ shape_attributes = generate_shape_attributes(shape_text)
118
+ shape_attributes = torch.LongTensor(shape_attributes).unsqueeze(0)
119
+ self.model.feed_shape_attributes(shape_attributes)
120
+ self.model.generate_parsing_map()
121
+ self.model.generate_quantized_segm()
122
+ colored_segm = self.model.palette_result(self.model.segm[0].cpu())
123
+ return colored_segm
124
+
125
+ def generate_human(self, label_image: np.ndarray, texture_text: str,
126
+ sample_steps: int, seed: int) -> np.ndarray:
127
+ if label_image is None:
128
+ return
129
+ mask = label_image.copy()
130
+ seg_map = self.process_mask(mask)
131
+ if seg_map is None:
132
+ return
133
+ self.model.segm = torch.from_numpy(seg_map).unsqueeze(0).unsqueeze(
134
+ 0).to(self.model.device)
135
+ self.model.generate_quantized_segm()
136
+
137
+ set_random_seed(seed)
138
+
139
+ texture_attributes = generate_texture_attributes(texture_text)
140
+ texture_attributes = torch.LongTensor(texture_attributes)
141
+ self.model.feed_texture_attributes(texture_attributes)
142
+ self.model.generate_texture_map()
143
+
144
+ self.model.sample_steps = sample_steps
145
+ out = self.model.sample_and_refine()
146
+ res = self.postprocess(out)
147
+ return res
patch ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ diff --git a/models/hierarchy_inference_model.py b/models/hierarchy_inference_model.py
2
+ index 3116307..5de661d 100644
3
+ --- a/models/hierarchy_inference_model.py
4
+ +++ b/models/hierarchy_inference_model.py
5
+ @@ -21,7 +21,7 @@ class VQGANTextureAwareSpatialHierarchyInferenceModel():
6
+
7
+ def __init__(self, opt):
8
+ self.opt = opt
9
+ - self.device = torch.device('cuda')
10
+ + self.device = torch.device(opt['device'])
11
+ self.is_train = opt['is_train']
12
+
13
+ self.top_encoder = Encoder(
14
+ diff --git a/models/hierarchy_vqgan_model.py b/models/hierarchy_vqgan_model.py
15
+ index 4b0d657..0bf4712 100644
16
+ --- a/models/hierarchy_vqgan_model.py
17
+ +++ b/models/hierarchy_vqgan_model.py
18
+ @@ -20,7 +20,7 @@ class HierarchyVQSpatialTextureAwareModel():
19
+
20
+ def __init__(self, opt):
21
+ self.opt = opt
22
+ - self.device = torch.device('cuda')
23
+ + self.device = torch.device(opt['device'])
24
+ self.top_encoder = Encoder(
25
+ ch=opt['top_ch'],
26
+ num_res_blocks=opt['top_num_res_blocks'],
27
+ diff --git a/models/parsing_gen_model.py b/models/parsing_gen_model.py
28
+ index 9440345..15a1ecb 100644
29
+ --- a/models/parsing_gen_model.py
30
+ +++ b/models/parsing_gen_model.py
31
+ @@ -22,7 +22,7 @@ class ParsingGenModel():
32
+
33
+ def __init__(self, opt):
34
+ self.opt = opt
35
+ - self.device = torch.device('cuda')
36
+ + self.device = torch.device(opt['device'])
37
+ self.is_train = opt['is_train']
38
+
39
+ self.attr_embedder = ShapeAttrEmbedding(
40
+ diff --git a/models/sample_model.py b/models/sample_model.py
41
+ index 4c60e3f..5265cd0 100644
42
+ --- a/models/sample_model.py
43
+ +++ b/models/sample_model.py
44
+ @@ -23,7 +23,7 @@ class BaseSampleModel():
45
+
46
+ def __init__(self, opt):
47
+ self.opt = opt
48
+ - self.device = torch.device('cuda')
49
+ + self.device = torch.device(opt['device'])
50
+
51
+ # hierarchical VQVAE
52
+ self.decoder = Decoder(
53
+ @@ -123,7 +123,7 @@ class BaseSampleModel():
54
+
55
+ def load_top_pretrain_models(self):
56
+ # load pretrained vqgan
57
+ - top_vae_checkpoint = torch.load(self.opt['top_vae_path'])
58
+ + top_vae_checkpoint = torch.load(self.opt['top_vae_path'], map_location=self.device)
59
+
60
+ self.decoder.load_state_dict(
61
+ top_vae_checkpoint['decoder'], strict=True)
62
+ @@ -137,7 +137,7 @@ class BaseSampleModel():
63
+ self.top_post_quant_conv.eval()
64
+
65
+ def load_bot_pretrain_network(self):
66
+ - checkpoint = torch.load(self.opt['bot_vae_path'])
67
+ + checkpoint = torch.load(self.opt['bot_vae_path'], map_location=self.device)
68
+ self.bot_decoder_res.load_state_dict(
69
+ checkpoint['bot_decoder_res'], strict=True)
70
+ self.decoder.load_state_dict(checkpoint['decoder'], strict=True)
71
+ @@ -153,7 +153,7 @@ class BaseSampleModel():
72
+
73
+ def load_pretrained_segm_token(self):
74
+ # load pretrained vqgan for segmentation mask
75
+ - segm_token_checkpoint = torch.load(self.opt['segm_token_path'])
76
+ + segm_token_checkpoint = torch.load(self.opt['segm_token_path'], map_location=self.device)
77
+ self.segm_encoder.load_state_dict(
78
+ segm_token_checkpoint['encoder'], strict=True)
79
+ self.segm_quantizer.load_state_dict(
80
+ @@ -166,7 +166,7 @@ class BaseSampleModel():
81
+ self.segm_quant_conv.eval()
82
+
83
+ def load_index_pred_network(self):
84
+ - checkpoint = torch.load(self.opt['pretrained_index_network'])
85
+ + checkpoint = torch.load(self.opt['pretrained_index_network'], map_location=self.device)
86
+ self.index_pred_guidance_encoder.load_state_dict(
87
+ checkpoint['guidance_encoder'], strict=True)
88
+ self.index_pred_decoder.load_state_dict(
89
+ @@ -176,7 +176,7 @@ class BaseSampleModel():
90
+ self.index_pred_decoder.eval()
91
+
92
+ def load_sampler_pretrained_network(self):
93
+ - checkpoint = torch.load(self.opt['pretrained_sampler'])
94
+ + checkpoint = torch.load(self.opt['pretrained_sampler'], map_location=self.device)
95
+ self.sampler_fn.load_state_dict(checkpoint, strict=True)
96
+ self.sampler_fn.eval()
97
+
98
+ @@ -397,7 +397,7 @@ class SampleFromPoseModel(BaseSampleModel):
99
+ [185, 210, 205], [130, 165, 180], [225, 141, 151]]
100
+
101
+ def load_shape_generation_models(self):
102
+ - checkpoint = torch.load(self.opt['pretrained_parsing_gen'])
103
+ + checkpoint = torch.load(self.opt['pretrained_parsing_gen'], map_location=self.device)
104
+
105
+ self.shape_attr_embedder.load_state_dict(
106
+ checkpoint['embedder'], strict=True)
107
+ diff --git a/models/transformer_model.py b/models/transformer_model.py
108
+ index 7db0f3e..4523d17 100644
109
+ --- a/models/transformer_model.py
110
+ +++ b/models/transformer_model.py
111
+ @@ -21,7 +21,7 @@ class TransformerTextureAwareModel():
112
+
113
+ def __init__(self, opt):
114
+ self.opt = opt
115
+ - self.device = torch.device('cuda')
116
+ + self.device = torch.device(opt['device'])
117
+ self.is_train = opt['is_train']
118
+
119
+ # VQVAE for image
120
+ @@ -317,10 +317,10 @@ class TransformerTextureAwareModel():
121
+ def sample_fn(self, temp=1.0, sample_steps=None):
122
+ self._denoise_fn.eval()
123
+
124
+ - b, device = self.image.size(0), 'cuda'
125
+ + b = self.image.size(0)
126
+ x_t = torch.ones(
127
+ - (b, np.prod(self.shape)), device=device).long() * self.mask_id
128
+ - unmasked = torch.zeros_like(x_t, device=device).bool()
129
+ + (b, np.prod(self.shape)), device=self.device).long() * self.mask_id
130
+ + unmasked = torch.zeros_like(x_t, device=self.device).bool()
131
+ sample_steps = list(range(1, sample_steps + 1))
132
+
133
+ texture_mask_flatten = self.texture_tokens.view(-1)
134
+ @@ -336,11 +336,11 @@ class TransformerTextureAwareModel():
135
+
136
+ for t in reversed(sample_steps):
137
+ print(f'Sample timestep {t:4d}', end='\r')
138
+ - t = torch.full((b, ), t, device=device, dtype=torch.long)
139
+ + t = torch.full((b, ), t, device=self.device, dtype=torch.long)
140
+
141
+ # where to unmask
142
+ changes = torch.rand(
143
+ - x_t.shape, device=device) < 1 / t.float().unsqueeze(-1)
144
+ + x_t.shape, device=self.device) < 1 / t.float().unsqueeze(-1)
145
+ # don't unmask somewhere already unmasked
146
+ changes = torch.bitwise_xor(changes,
147
+ torch.bitwise_and(changes, unmasked))
148
+ diff --git a/models/vqgan_model.py b/models/vqgan_model.py
149
+ index 13a2e70..9c840f1 100644
150
+ --- a/models/vqgan_model.py
151
+ +++ b/models/vqgan_model.py
152
+ @@ -20,7 +20,7 @@ class VQModel():
153
+ def __init__(self, opt):
154
+ super().__init__()
155
+ self.opt = opt
156
+ - self.device = torch.device('cuda')
157
+ + self.device = torch.device(opt['device'])
158
+ self.encoder = Encoder(
159
+ ch=opt['ch'],
160
+ num_res_blocks=opt['num_res_blocks'],
161
+ @@ -390,7 +390,7 @@ class VQImageSegmTextureModel(VQImageModel):
162
+
163
+ def __init__(self, opt):
164
+ self.opt = opt
165
+ - self.device = torch.device('cuda')
166
+ + self.device = torch.device(opt['device'])
167
+ self.encoder = Encoder(
168
+ ch=opt['ch'],
169
+ num_res_blocks=opt['num_res_blocks'],
pose_images/000.png ADDED
pose_images/001.png ADDED
pose_images/002.png ADDED
pose_images/003.png ADDED
pose_images/004.png ADDED
pose_images/005.png ADDED
requirements.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ einops==0.4.1
2
+ lpips==0.1.4
3
+ mmcv-full==1.5.2
4
+ mmsegmentation==0.24.1
5
+ numpy==1.22.3
6
+ openmim==0.1.5
7
+ Pillow==9.1.1
8
+ sentence-transformers==2.2.0
9
+ tokenizers==0.12.1
10
+ torch==1.11.0
11
+ torchvision==0.12.0
12
+ transformers==4.19.2
style.css ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ h1 {
2
+ text-align: center;
3
+ }
4
+ #input-image {
5
+ max-height: 300px;
6
+ }
7
+ #label-image {
8
+ height: 300px;
9
+ }
10
+ #result-image {
11
+ height: 300px;
12
+ }
13
+ img#visitor-badge {
14
+ display: block;
15
+ margin: auto;
16
+ }