--- title: Unsupervised Generative Video Dubbing emoji: 🎥 colorFrom: blue colorTo: blue sdk: gradio sdk_version: 5.9.1 app_file: app.py pinned: false license: gpl-3.0 short_description: enjoy --- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference # Unsupervised Generative Video Dubbing Author: Jimin Tan, Chenqin Yang, Yakun Wang, Yash Deshpande Project Website: [https://tanjimin.github.io/unsupervised-video-dubbing/](https://tanjimin.github.io/unsupervised-video-dubbing/) Training Code for the dubbing model is under the root directory. We used a pre-processed LRW for training. See `data.py` for details. We created a simple depolyment pipeline which can be find under `post_processing` subdirectory. The pipeline takes the model weights we pre-trained on LRW. The pipeline takes a video and a equal duration audio segments and output a dubbed video based on audio information. See the instruction below for more details. ## Requirement - LibROSA 0.7.2 - dlib 19.19 - OpenCV 4.2.0 - Pillow 6.2.2 - PyTorch 1.2.0 - TorchVision 0.4.0 ## Post-Procesing Folder ``` . ├── source │ ├── audio_driver_mp4 # contain audio drivers (saved in mp4 format) │ ├── audio_driver_wav # contain audio drivers (saved in wav format) │ ├── base_video # contain base videos (videos you'd like to modify) │ ├── dlib # trained dlib models │ └── model # trained landmark generation models ├── main.py # main function for post processing ├── main_support.py # support functions used in main.py ├── models.py # define the landmark generation model ├── step_3_vid2vid.sh # Bash script for running vid2vid ├── step_4_denoise.sh. # Bash script for denoising vid2vid results ├── compare_openness.ipynb # mouth openness comparison across generated videos └── README.md ``` > - shape_predictor_68_face_landmarks.dat > > This is trained on the ibug 300-W dataset (https://ibug.doc.ic.ac.uk/resources/facial-point-annotations/) > > The license for this dataset excludes commercial use and Stefanos Zafeiriou, one of the creators of the dataset, asked me to include a note here saying that the trained model therefore can't be used in a commerical product. So you should contact a lawyer or talk to Imperial College London to find out if it's OK for you to use this model in a commercial product. > > {C. Sagonas, E. Antonakos, G, Tzimiropoulos, S. Zafeiriou, M. Pantic. 300 faces In-the-wild challenge: Database and results. Image and Vision Computing (IMAVIS), Special Issue on Facial Landmark Localisation "In-The-Wild". 2016.} ## Detailed steps for model deployment - **Go to** `post_processing` directory - Run: ```python3 main.py -r step ``` (corresponding step below) - e.g: `python3 main.py -r 1` will run the first step and etc #### Step 1 — generate landmarks - Input - Base video file path (`./source/base_video/base_video.mp4`) - Audio driver file path (`./source/audio_driver_wav/audio_driver.wav`) - Epoch (`int`) - Output (`./result`) - keypoints.npy (# generated landmarks in `npy` format) - source.txt (contains information about base video, audio driver, model epoch) - Process - Extract facial landmarks from base video - Extract MFCC features from driver audio - Pass MFCC features and facial landmarks into the model to retrieve mouth landmarks - Combine facial & mouth landmarks and save in `npy` format #### Step 2 — Test generated frames - Input - None - Output (`./result`) - Folder — save_keypoints: visualized generated frames - Folder — save_keypoints_csv : landmark coordinates for each frame, saved in `txt` format - openness.png: mouth openness measured and plotted across all frames - Process - Generate images from `npy` file - Generate openness plot #### Step 3 — Execute vid2vid - Input - None - Output - Path for generated fake images from vid2vid are shown at the end; Please copy it back to the `/result/vid2vid_frames/` - Folder: vid2vid generated images - Process - Run vid2vid - Copy back vid2vid results to main folder #### Step 4 — Denoise and smooth vid2vid results - Input - vid2vid generated images folder path - Original base images folder path - Output - Folder: Modified images (base image + vid2vid mouth regions) - Folder: Denoised and smoothed frames - Process - Crop mouth areas from vid2vid generated images and paste them back to base images —> modified image - Generate circular smoothed images by using gradient masking - Take `(modified image, circular smoothed images)` as pairs and do denoising #### Step 5 — Generate modified videos with sound - Input - Saved frames folder path - By default, it is saved in `./result/save_keypoints`; you can enter `d` to go with default path - Otherwise, input the frames folder path - Audio driver file path (`./source/audio_driver_wav/audio_driver.wav`) - Output (`./result/save_keypoints/result/`) - video_without_sound.mp4: modified videos without sound - audio_only.mp4: audio driver - final_output.mp4: modified videos with sound - Process - Generate the modified video without sound with define fps - Extract `wav` from audio driver - Combine audio and video to generate final output ## Important Notice - You may need to modify how MFCC features are extracted in `extract_mfcc` function - Be careful about sample rate, window_length, hop_length - Good resource: https://www.mathworks.com/help/audio/ref/mfcc.html - You may need to modify the region of interest (mouth area) in `frame_crop` function - You may need to modify the frame rate defined in step_3 of the main.py, which should be your base video fps ```python # How to check your base video fps # source: https://www.learnopencv.com/how-to-find-frame-rate-or-frames-per-second-fps-in-opencv-python-cpp/ import cv2 video = cv2.VideoCapture("video.mp4"); # Find OpenCV version (major_ver, minor_ver, subminor_ver) = (cv2.__version__).split('.') if int(major_ver) < 3 : fps = video.get(cv2.cv.CV_CAP_PROP_FPS) print("Frames per second using video.get(cv2.cv.CV_CAP_PROP_FPS): {0}".format(fps)) else : fps = video.get(cv2.CAP_PROP_FPS) print("Frames per second using video.get(cv2.CAP_PROP_FPS) : {0}".format(fps)) video.release() ``` - You may need to modify the shell path ```shell echo $SHELL ``` - You may need to modify the audio sampling rate in `extract_audio` function - You may need to customize your parameters in `combine_audio_video` function - Good resource: https://ffmpeg.org/ffmpeg.html - https://gist.github.com/tayvano/6e2d456a9897f55025e25035478a3a50 ## Update History - March 22, 2020: Drafted documentation