animatedaliensfans commited on
Commit
ab9d825
Β·
verified Β·
1 Parent(s): 0d86e22

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +167 -1
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  title: Unsupervised Generative Video Dubbing
3
- emoji: 🐨
4
  colorFrom: blue
5
  colorTo: blue
6
  sdk: gradio
@@ -12,3 +12,169 @@ short_description: enjoy
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Unsupervised Generative Video Dubbing
3
+ emoji: πŸŽ₯
4
  colorFrom: blue
5
  colorTo: blue
6
  sdk: gradio
 
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
15
+
16
+ # Unsupervised Generative Video Dubbing
17
+
18
+ Author: Jimin Tan, Chenqin Yang, Yakun Wang, Yash Deshpande
19
+
20
+ Project Website: [https://tanjimin.github.io/unsupervised-video-dubbing/](https://tanjimin.github.io/unsupervised-video-dubbing/)
21
+
22
+
23
+ Training Code for the dubbing model is under the root directory. We used a pre-processed LRW for training. See `data.py` for details.
24
+
25
+
26
+ We created a simple depolyment pipeline which can be find under `post_processing` subdirectory. The pipeline takes the model weights we pre-trained on LRW. The pipeline takes a video and a equal duration audio segments and output a dubbed video based on audio information. See the instruction below for more details.
27
+
28
+ ## Requirement
29
+
30
+ - LibROSA 0.7.2
31
+ - dlib 19.19
32
+ - OpenCV 4.2.0
33
+
34
+ - Pillow 6.2.2
35
+ - PyTorch 1.2.0
36
+ - TorchVision 0.4.0
37
+
38
+ ## Post-Procesing Folder
39
+
40
+ ```
41
+ .
42
+ β”œβ”€β”€ source
43
+ β”‚ β”œβ”€β”€ audio_driver_mp4 # contain audio drivers (saved in mp4 format)
44
+ β”‚ β”œβ”€β”€ audio_driver_wav # contain audio drivers (saved in wav format)
45
+ β”‚ β”œβ”€β”€ base_video # contain base videos (videos you'd like to modify)
46
+ β”‚ β”œβ”€β”€ dlib # trained dlib models
47
+ β”‚ └── model # trained landmark generation models
48
+ β”œβ”€β”€ main.py # main function for post processing
49
+ β”œβ”€β”€ main_support.py # support functions used in main.py
50
+ β”œβ”€β”€ models.py # define the landmark generation model
51
+ β”œβ”€β”€ step_3_vid2vid.sh # Bash script for running vid2vid
52
+ β”œβ”€β”€ step_4_denoise.sh. # Bash script for denoising vid2vid results
53
+ β”œβ”€β”€ compare_openness.ipynb # mouth openness comparison across generated videos
54
+ └── README.md
55
+ ```
56
+
57
+ > - shape_predictor_68_face_landmarks.dat
58
+ >
59
+ > This is trained on the ibug 300-W dataset (https://ibug.doc.ic.ac.uk/resources/facial-point-annotations/)
60
+ >
61
+ > The license for this dataset excludes commercial use and Stefanos Zafeiriou, one of the creators of the dataset, asked me to include a note here saying that the trained model therefore can't be used in a commerical product. So you should contact a lawyer or talk to Imperial College London to find out if it's OK for you to use this model in a commercial product.
62
+ >
63
+ > {C. Sagonas, E. Antonakos, G, Tzimiropoulos, S. Zafeiriou, M. Pantic. 300 faces In-the-wild challenge: Database and results. Image and Vision Computing (IMAVIS), Special Issue on Facial Landmark Localisation "In-The-Wild". 2016.}
64
+
65
+ ## Detailed steps for model deployment
66
+
67
+
68
+ - **Go to** `post_processing` directory
69
+ - Run: ```python3 main.py -r step ``` (corresponding step below)
70
+ - e.g: `python3 main.py -r 1` will run the first step and etc
71
+
72
+ #### Step 1 β€” generate landmarks
73
+
74
+ - Input
75
+ - Base video file path (`./source/base_video/base_video.mp4`)
76
+ - Audio driver file path (`./source/audio_driver_wav/audio_driver.wav`)
77
+ - Epoch (`int`)
78
+ - Output (`./result`)
79
+ - keypoints.npy (# generated landmarks in `npy` format)
80
+ - source.txt (contains information about base video, audio driver, model epoch)
81
+ - Process
82
+ - Extract facial landmarks from base video
83
+ - Extract MFCC features from driver audio
84
+ - Pass MFCC features and facial landmarks into the model to retrieve mouth landmarks
85
+ - Combine facial & mouth landmarks and save in `npy` format
86
+
87
+ #### Step 2 β€” Test generated frames
88
+
89
+ - Input
90
+ - None
91
+ - Output (`./result`)
92
+ - Folder β€” save_keypoints: visualized generated frames
93
+ - Folder β€” save_keypoints_csv : landmark coordinates for each frame, saved in `txt` format
94
+ - openness.png: mouth openness measured and plotted across all frames
95
+ - Process
96
+ - Generate images from `npy` file
97
+ - Generate openness plot
98
+
99
+ #### Step 3 β€” Execute vid2vid
100
+
101
+ - Input
102
+ - None
103
+ - Output
104
+ - Path for generated fake images from vid2vid are shown at the end; Please copy it back to the `/result/vid2vid_frames/`
105
+ - Folder: vid2vid generated images
106
+ - Process
107
+ - Run vid2vid
108
+ - Copy back vid2vid results to main folder
109
+
110
+ #### Step 4 β€” Denoise and smooth vid2vid results
111
+
112
+ - Input
113
+ - vid2vid generated images folder path
114
+ - Original base images folder path
115
+ - Output
116
+ - Folder: Modified images (base image + vid2vid mouth regions)
117
+ - Folder: Denoised and smoothed frames
118
+ - Process
119
+ - Crop mouth areas from vid2vid generated images and paste them back to base images β€”> modified image
120
+ - Generate circular smoothed images by using gradient masking
121
+ - Take `(modified image, circular smoothed images)` as pairs and do denoising
122
+
123
+ #### Step 5 β€” Generate modified videos with sound
124
+
125
+ - Input
126
+ - Saved frames folder path
127
+ - By default, it is saved in `./result/save_keypoints`; you can enter `d` to go with default path
128
+ - Otherwise, input the frames folder path
129
+ - Audio driver file path (`./source/audio_driver_wav/audio_driver.wav`)
130
+ - Output (`./result/save_keypoints/result/`)
131
+ - video_without_sound.mp4: modified videos without sound
132
+ - audio_only.mp4: audio driver
133
+ - final_output.mp4: modified videos with sound
134
+ - Process
135
+ - Generate the modified video without sound with define fps
136
+ - Extract `wav` from audio driver
137
+ - Combine audio and video to generate final output
138
+
139
+ ## Important Notice
140
+
141
+ - You may need to modify how MFCC features are extracted in `extract_mfcc` function
142
+ - Be careful about sample rate, window_length, hop_length
143
+ - Good resource: https://www.mathworks.com/help/audio/ref/mfcc.html
144
+ - You may need to modify the region of interest (mouth area) in `frame_crop` function
145
+ - You may need to modify the frame rate defined in step_3 of the main.py, which should be your base video fps
146
+
147
+ ```python
148
+ # How to check your base video fps
149
+ # source: https://www.learnopencv.com/how-to-find-frame-rate-or-frames-per-second-fps-in-opencv-python-cpp/
150
+
151
+ import cv2
152
+ video = cv2.VideoCapture("video.mp4");
153
+
154
+ # Find OpenCV version
155
+ (major_ver, minor_ver, subminor_ver) = (cv2.__version__).split('.')
156
+ if int(major_ver) < 3 :
157
+ fps = video.get(cv2.cv.CV_CAP_PROP_FPS)
158
+ print("Frames per second using video.get(cv2.cv.CV_CAP_PROP_FPS): {0}".format(fps))
159
+ else :
160
+ fps = video.get(cv2.CAP_PROP_FPS)
161
+ print("Frames per second using video.get(cv2.CAP_PROP_FPS) : {0}".format(fps))
162
+ video.release()
163
+ ```
164
+
165
+ - You may need to modify the shell path
166
+
167
+ ```shell
168
+ echo $SHELL
169
+ ```
170
+
171
+ - You may need to modify the audio sampling rate in `extract_audio` function
172
+ - You may need to customize your parameters in `combine_audio_video` function
173
+ - Good resource: https://ffmpeg.org/ffmpeg.html
174
+ - https://gist.github.com/tayvano/6e2d456a9897f55025e25035478a3a50
175
+
176
+
177
+
178
+ ## Update History
179
+
180
+ - March 22, 2020: Drafted documentation