animatedaliensfans
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1 |
---
|
2 |
title: Unsupervised Generative Video Dubbing
|
3 |
-
emoji:
|
4 |
colorFrom: blue
|
5 |
colorTo: blue
|
6 |
sdk: gradio
|
@@ -12,3 +12,169 @@ short_description: enjoy
|
|
12 |
---
|
13 |
|
14 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
title: Unsupervised Generative Video Dubbing
|
3 |
+
emoji: π₯
|
4 |
colorFrom: blue
|
5 |
colorTo: blue
|
6 |
sdk: gradio
|
|
|
12 |
---
|
13 |
|
14 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
15 |
+
|
16 |
+
# Unsupervised Generative Video Dubbing
|
17 |
+
|
18 |
+
Author: Jimin Tan, Chenqin Yang, Yakun Wang, Yash Deshpande
|
19 |
+
|
20 |
+
Project Website: [https://tanjimin.github.io/unsupervised-video-dubbing/](https://tanjimin.github.io/unsupervised-video-dubbing/)
|
21 |
+
|
22 |
+
|
23 |
+
Training Code for the dubbing model is under the root directory. We used a pre-processed LRW for training. See `data.py` for details.
|
24 |
+
|
25 |
+
|
26 |
+
We created a simple depolyment pipeline which can be find under `post_processing` subdirectory. The pipeline takes the model weights we pre-trained on LRW. The pipeline takes a video and a equal duration audio segments and output a dubbed video based on audio information. See the instruction below for more details.
|
27 |
+
|
28 |
+
## Requirement
|
29 |
+
|
30 |
+
- LibROSA 0.7.2
|
31 |
+
- dlib 19.19
|
32 |
+
- OpenCV 4.2.0
|
33 |
+
|
34 |
+
- Pillow 6.2.2
|
35 |
+
- PyTorch 1.2.0
|
36 |
+
- TorchVision 0.4.0
|
37 |
+
|
38 |
+
## Post-Procesing Folder
|
39 |
+
|
40 |
+
```
|
41 |
+
.
|
42 |
+
βββ source
|
43 |
+
β βββ audio_driver_mp4 # contain audio drivers (saved in mp4 format)
|
44 |
+
β βββ audio_driver_wav # contain audio drivers (saved in wav format)
|
45 |
+
β βββ base_video # contain base videos (videos you'd like to modify)
|
46 |
+
β βββ dlib # trained dlib models
|
47 |
+
β βββ model # trained landmark generation models
|
48 |
+
βββ main.py # main function for post processing
|
49 |
+
βββ main_support.py # support functions used in main.py
|
50 |
+
βββ models.py # define the landmark generation model
|
51 |
+
βββ step_3_vid2vid.sh # Bash script for running vid2vid
|
52 |
+
βββ step_4_denoise.sh. # Bash script for denoising vid2vid results
|
53 |
+
βββ compare_openness.ipynb # mouth openness comparison across generated videos
|
54 |
+
βββ README.md
|
55 |
+
```
|
56 |
+
|
57 |
+
> - shape_predictor_68_face_landmarks.dat
|
58 |
+
>
|
59 |
+
> This is trained on the ibug 300-W dataset (https://ibug.doc.ic.ac.uk/resources/facial-point-annotations/)
|
60 |
+
>
|
61 |
+
> The license for this dataset excludes commercial use and Stefanos Zafeiriou, one of the creators of the dataset, asked me to include a note here saying that the trained model therefore can't be used in a commerical product. So you should contact a lawyer or talk to Imperial College London to find out if it's OK for you to use this model in a commercial product.
|
62 |
+
>
|
63 |
+
> {C. Sagonas, E. Antonakos, G, Tzimiropoulos, S. Zafeiriou, M. Pantic. 300 faces In-the-wild challenge: Database and results. Image and Vision Computing (IMAVIS), Special Issue on Facial Landmark Localisation "In-The-Wild". 2016.}
|
64 |
+
|
65 |
+
## Detailed steps for model deployment
|
66 |
+
|
67 |
+
|
68 |
+
- **Go to** `post_processing` directory
|
69 |
+
- Run: ```python3 main.py -r step ``` (corresponding step below)
|
70 |
+
- e.g: `python3 main.py -r 1` will run the first step and etc
|
71 |
+
|
72 |
+
#### Step 1 β generate landmarks
|
73 |
+
|
74 |
+
- Input
|
75 |
+
- Base video file path (`./source/base_video/base_video.mp4`)
|
76 |
+
- Audio driver file path (`./source/audio_driver_wav/audio_driver.wav`)
|
77 |
+
- Epoch (`int`)
|
78 |
+
- Output (`./result`)
|
79 |
+
- keypoints.npy (# generated landmarks in `npy` format)
|
80 |
+
- source.txt (contains information about base video, audio driver, model epoch)
|
81 |
+
- Process
|
82 |
+
- Extract facial landmarks from base video
|
83 |
+
- Extract MFCC features from driver audio
|
84 |
+
- Pass MFCC features and facial landmarks into the model to retrieve mouth landmarks
|
85 |
+
- Combine facial & mouth landmarks and save in `npy` format
|
86 |
+
|
87 |
+
#### Step 2 β Test generated frames
|
88 |
+
|
89 |
+
- Input
|
90 |
+
- None
|
91 |
+
- Output (`./result`)
|
92 |
+
- Folder β save_keypoints: visualized generated frames
|
93 |
+
- Folder β save_keypoints_csv : landmark coordinates for each frame, saved in `txt` format
|
94 |
+
- openness.png: mouth openness measured and plotted across all frames
|
95 |
+
- Process
|
96 |
+
- Generate images from `npy` file
|
97 |
+
- Generate openness plot
|
98 |
+
|
99 |
+
#### Step 3 β Execute vid2vid
|
100 |
+
|
101 |
+
- Input
|
102 |
+
- None
|
103 |
+
- Output
|
104 |
+
- Path for generated fake images from vid2vid are shown at the end; Please copy it back to the `/result/vid2vid_frames/`
|
105 |
+
- Folder: vid2vid generated images
|
106 |
+
- Process
|
107 |
+
- Run vid2vid
|
108 |
+
- Copy back vid2vid results to main folder
|
109 |
+
|
110 |
+
#### Step 4 β Denoise and smooth vid2vid results
|
111 |
+
|
112 |
+
- Input
|
113 |
+
- vid2vid generated images folder path
|
114 |
+
- Original base images folder path
|
115 |
+
- Output
|
116 |
+
- Folder: Modified images (base image + vid2vid mouth regions)
|
117 |
+
- Folder: Denoised and smoothed frames
|
118 |
+
- Process
|
119 |
+
- Crop mouth areas from vid2vid generated images and paste them back to base images β> modified image
|
120 |
+
- Generate circular smoothed images by using gradient masking
|
121 |
+
- Take `(modified image, circular smoothed images)` as pairs and do denoising
|
122 |
+
|
123 |
+
#### Step 5 β Generate modified videos with sound
|
124 |
+
|
125 |
+
- Input
|
126 |
+
- Saved frames folder path
|
127 |
+
- By default, it is saved in `./result/save_keypoints`; you can enter `d` to go with default path
|
128 |
+
- Otherwise, input the frames folder path
|
129 |
+
- Audio driver file path (`./source/audio_driver_wav/audio_driver.wav`)
|
130 |
+
- Output (`./result/save_keypoints/result/`)
|
131 |
+
- video_without_sound.mp4: modified videos without sound
|
132 |
+
- audio_only.mp4: audio driver
|
133 |
+
- final_output.mp4: modified videos with sound
|
134 |
+
- Process
|
135 |
+
- Generate the modified video without sound with define fps
|
136 |
+
- Extract `wav` from audio driver
|
137 |
+
- Combine audio and video to generate final output
|
138 |
+
|
139 |
+
## Important Notice
|
140 |
+
|
141 |
+
- You may need to modify how MFCC features are extracted in `extract_mfcc` function
|
142 |
+
- Be careful about sample rate, window_length, hop_length
|
143 |
+
- Good resource: https://www.mathworks.com/help/audio/ref/mfcc.html
|
144 |
+
- You may need to modify the region of interest (mouth area) in `frame_crop` function
|
145 |
+
- You may need to modify the frame rate defined in step_3 of the main.py, which should be your base video fps
|
146 |
+
|
147 |
+
```python
|
148 |
+
# How to check your base video fps
|
149 |
+
# source: https://www.learnopencv.com/how-to-find-frame-rate-or-frames-per-second-fps-in-opencv-python-cpp/
|
150 |
+
|
151 |
+
import cv2
|
152 |
+
video = cv2.VideoCapture("video.mp4");
|
153 |
+
|
154 |
+
# Find OpenCV version
|
155 |
+
(major_ver, minor_ver, subminor_ver) = (cv2.__version__).split('.')
|
156 |
+
if int(major_ver) < 3 :
|
157 |
+
fps = video.get(cv2.cv.CV_CAP_PROP_FPS)
|
158 |
+
print("Frames per second using video.get(cv2.cv.CV_CAP_PROP_FPS): {0}".format(fps))
|
159 |
+
else :
|
160 |
+
fps = video.get(cv2.CAP_PROP_FPS)
|
161 |
+
print("Frames per second using video.get(cv2.CAP_PROP_FPS) : {0}".format(fps))
|
162 |
+
video.release()
|
163 |
+
```
|
164 |
+
|
165 |
+
- You may need to modify the shell path
|
166 |
+
|
167 |
+
```shell
|
168 |
+
echo $SHELL
|
169 |
+
```
|
170 |
+
|
171 |
+
- You may need to modify the audio sampling rate in `extract_audio` function
|
172 |
+
- You may need to customize your parameters in `combine_audio_video` function
|
173 |
+
- Good resource: https://ffmpeg.org/ffmpeg.html
|
174 |
+
- https://gist.github.com/tayvano/6e2d456a9897f55025e25035478a3a50
|
175 |
+
|
176 |
+
|
177 |
+
|
178 |
+
## Update History
|
179 |
+
|
180 |
+
- March 22, 2020: Drafted documentation
|