---
title: Unsupervised Generative Video Dubbing
emoji: 🎥
colorFrom: blue
colorTo: blue
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
license: gpl-3.0
short_description: enjoy
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Unsupervised Generative Video Dubbing

Author: Jimin Tan, Chenqin Yang, Yakun Wang, Yash Deshpande

Project Website: [https://tanjimin.github.io/unsupervised-video-dubbing/](https://tanjimin.github.io/unsupervised-video-dubbing/)


Training Code for the dubbing model is under the root directory. We used a pre-processed LRW for training. See `data.py` for details.


We created a simple depolyment pipeline which can be find under  `post_processing` subdirectory. The pipeline takes the model weights we pre-trained on LRW. The pipeline takes a video and a equal duration audio segments and output a dubbed video based on audio information. See the instruction below for more details. 

## Requirement

- LibROSA 0.7.2 
- dlib 19.19
- OpenCV 4.2.0

- Pillow 6.2.2
- PyTorch 1.2.0
- TorchVision 0.4.0

## Post-Procesing Folder 

```
.
├── source                  
│   ├── audio_driver_mp4    # contain audio drivers (saved in mp4 format)
│   ├── audio_driver_wav    # contain audio drivers (saved in wav format)
│   ├── base_video          # contain base videos (videos you'd like to modify)
│   ├── dlib            		# trained dlib models
│   └── model               # trained landmark generation models
├── main.py									# main function for post processing
├── main_support.py					# support functions used in main.py
├── models.py								# define the landmark generation model
├── step_3_vid2vid.sh		  	# Bash script for running vid2vid
├── step_4_denoise.sh.      # Bash script for denoising vid2vid results
├── compare_openness.ipynb  # mouth openness comparison across generated videos
└── README.md
```

> - shape_predictor_68_face_landmarks.dat
>
> This is trained on the ibug 300-W dataset (https://ibug.doc.ic.ac.uk/resources/facial-point-annotations/)
>
> The license for this dataset excludes commercial use and Stefanos Zafeiriou, one of the creators of the dataset, asked me to include a note here saying that the trained model therefore can't be used in a commerical product. So you should contact a lawyer or talk to Imperial College London to find out if it's OK for you to use this model in a commercial product.
>
> {C. Sagonas, E. Antonakos, G, Tzimiropoulos, S. Zafeiriou, M. Pantic. 300 faces In-the-wild challenge: Database and results. Image and Vision Computing (IMAVIS), Special Issue on Facial Landmark Localisation "In-The-Wild". 2016.}

## Detailed steps for model deployment


- **Go to** `post_processing` directory
- Run: ```python3 main.py -r step  ``` (corresponding step below)
  - e.g: `python3 main.py -r 1` will run the first step and etc 

#### Step 1 — generate landmarks

- Input
  - Base video file path (`./source/base_video/base_video.mp4`)
  - Audio driver file path (`./source/audio_driver_wav/audio_driver.wav`)
  - Epoch (`int`)
- Output (`./result`)
  - keypoints.npy (# generated landmarks in `npy` format)
  - source.txt (contains information about base video, audio driver, model epoch)
- Process
  - Extract facial landmarks from base video
  - Extract MFCC features from driver audio
  - Pass MFCC features and facial landmarks into the model to retrieve mouth landmarks
  - Combine facial & mouth landmarks and save in `npy` format

#### Step 2 — Test generated frames

- Input
  - None
- Output (`./result`)
  - Folder — save_keypoints: visualized generated frames
  - Folder — save_keypoints_csv : landmark coordinates for each frame, saved in `txt` format
  - openness.png: mouth openness measured and plotted across all frames
- Process
  - Generate images from `npy` file
  - Generate openness plot

#### Step 3 — Execute vid2vid

- Input
  - None
- Output
  - Path for generated fake images from vid2vid are shown at the end; Please copy it back to the `/result/vid2vid_frames/`
    - Folder: vid2vid generated images
- Process
  - Run vid2vid
  - Copy back vid2vid results to main folder

#### Step 4 — Denoise and smooth vid2vid results

- Input
  - vid2vid generated images folder path
  - Original base images folder path
- Output
  - Folder: Modified images (base image + vid2vid mouth regions)
  - Folder: Denoised and smoothed frames
- Process
  - Crop mouth areas from vid2vid generated images and paste them back to base images —> modified image
  - Generate circular smoothed images by using gradient masking 
  - Take `(modified image, circular smoothed images)` as pairs and do denoising

#### Step 5 — Generate modified videos with sound

- Input
  - Saved frames folder path
    - By default, it is saved in `./result/save_keypoints`; you can enter `d` to go with default path
    - Otherwise, input the frames folder path
  - Audio driver file path (`./source/audio_driver_wav/audio_driver.wav`)
- Output (`./result/save_keypoints/result/`)
  - video_without_sound.mp4: modified videos without sound
  - audio_only.mp4: audio driver
  - final_output.mp4: modified videos with sound
- Process
  - Generate the modified video without sound with define fps
  - Extract `wav` from audio driver
  - Combine audio and video to generate final output

## Important Notice

- You may need to modify how MFCC features are extracted in `extract_mfcc` function
  - Be careful about sample rate, window_length, hop_length
  - Good resource: https://www.mathworks.com/help/audio/ref/mfcc.html
- You may need to modify the region of interest (mouth area) in `frame_crop` function
- You may need to modify the frame rate defined in step_3 of the main.py, which should be your base video fps

```python
# How to check your base video fps
# source: https://www.learnopencv.com/how-to-find-frame-rate-or-frames-per-second-fps-in-opencv-python-cpp/

import cv2
video = cv2.VideoCapture("video.mp4");

# Find OpenCV version
(major_ver, minor_ver, subminor_ver) = (cv2.__version__).split('.')
if int(major_ver)  < 3 :
    fps = video.get(cv2.cv.CV_CAP_PROP_FPS)
    print("Frames per second using video.get(cv2.cv.CV_CAP_PROP_FPS): {0}".format(fps))
else :
    fps = video.get(cv2.CAP_PROP_FPS)
    print("Frames per second using video.get(cv2.CAP_PROP_FPS) : {0}".format(fps))
video.release()
```

- You may need to modify the shell path

```shell
echo $SHELL
```

- You may need to modify the audio sampling rate in `extract_audio` function
- You may need to customize your parameters in `combine_audio_video` function
  - Good resource: https://ffmpeg.org/ffmpeg.html
  - https://gist.github.com/tayvano/6e2d456a9897f55025e25035478a3a50


## Update History

- March 22, 2020: Drafted documentation