Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos
Abstract
The recent state of the art on monocular 3D face reconstruction from image data has made some impressive advancements, thanks to the advent of Deep Learning. However, it has mostly focused on input coming from a single RGB image, overlooking the following important factors: a) Nowadays, the vast majority of facial image data of interest do not originate from single images but rather from videos, which contain rich dynamic information. b) Furthermore, these videos typically capture individuals in some form of verbal communication (public talks, teleconferences, audiovisual human-computer interactions, interviews, monologues/dialogues in movies, etc). When existing 3D face reconstruction methods are applied in such videos, the artifacts in the reconstruction of the shape and motion of the mouth area are often severe, since they do not match well with the speech audio. To overcome the aforementioned limitations, we present the first method for visual speech-aware perceptual reconstruction of 3D mouth expressions. We do this by proposing a "lipread" loss, which guides the fitting process so that the elicited perception from the 3D reconstructed talking head resembles that of the original video footage. We demonstrate that, interestingly, the lipread loss is better suited for 3D reconstruction of mouth movements compared to traditional landmark losses, and even direct 3D supervision. Furthermore, the devised method does not rely on any text transcriptions or corresponding audio, rendering it ideal for training in unlabeled datasets. We verify the efficiency of our method through exhaustive objective evaluations on three large-scale datasets, as well as subjective evaluation with two web-based user studies.
Community
Proposes Spectre: a monocular RGB to 3D face (mesh) reconstruction pipeline; trained on videos (exploit dynamic nature/temporal consistency) with verbal communication and with a lip-reading (speech perceptual/viseme) loss (without text transcription of audio); lip-reader captures 3D mouth expressions. Builds on DECA (jointly regress over FLAME parameters using CNN and multiple loss coefficients to tackle lack of 3D data); coarse encoder (ResNet50) predicts identity parameters (100d), neck pose and jaw (6D), expression (50d), albedo (50d), lighting (27d), and camera (3d); EMOCA added expression branch. Extract 3D mesh of image frames (of video) in FLAME topology/parameters; rigid and identity parameters (identity, neck pose, albedo, lighting, and camera) using DECA’s (fixed) coarse (identity/scene) encoder; expression and jaw parameters predicted through perceptual CNN encoder (MobileNet v2 with temporal convolution kernel). Output of perceptual and identity encoder (3D mesh’s FLAME parameters) given to differentiable renderer, emotion recognition network (from EMOCA, takes input video and rendered image sequence) gives perceptual expression loss (minimize distance between two sequences of feature vectors) - helpful for generic facial characteristics; uses a pretrained network (trained for mouth cropped image sequence to character sequence prediction on Lip Reading in the Wild 3 - LRS3, connectionist temporal classification - CTC loss with attention, architecture has 3D conv, ResNet, conformer, then transformer decoder), get mouth crops of differentiably rendered images and input images and get feature vectors for ResNet (output), minimize cosine distance/similarity (averaged over sequence) for perceptual lip movement loss; expression and jaw further regularized by L2 distance from DECA’s encoder/prediction; L1 loss between landmarks of nose, face outline, and eyes (face alignment landmarks); L2 relative loss between intra-distances of mouth landmarks. Final loss is linear combination of lip reading, expression, and geometric losses (distance-based from landmarks). Trained on LRS3; tested on LRS3, MEAD (reading from TIMIT with expressions), TCD-TIMIT; baselines are DECA, EMOCA, 3DDFAv2, and DAD-3DHeads (direct supervision). Lip-reading evaluation done by using AV-HuBERT; lowest character (CER), word (WER), viseme (VER - from phoneme-to-viseme Amazon Polly mapping), and viseme-word (VWER) error rates. Realism evaluation through human feedback/preferences (preferred over baselines) - user picks realistic faces; lip-reading evaluation - user picks words uttered by 3D talking head (word-level lip reading by non-experts). Using ResNet for lip-readin over conformer is better (latter does sequence modeling, not close to image-level). Appendix/supplementary material shows direct landmarks are inaccurate/noisy (specially around lip region), ablation on features (ResNet or Conformer) and CTC loss (text prediction on lip-reader), details on loss function, failure cases (domain gap and failure of lip-reader), and more qualitative results. From NTH Athens, FORTH (Greece), Imperial College (London).
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper