Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head video Generation
Abstract
Talking head video generation aims to animate a human face in a still image with dynamic poses and expressions using motion information derived from a target-driving video, while maintaining the person's identity in the source image. However, dramatic and complex motions in the driving video cause ambiguous generation, because the still source image cannot provide sufficient appearance information for occluded regions or delicate expression variations, which produces severe artifacts and significantly degrades the generation quality. To tackle this problem, we propose to learn a global facial representation space, and design a novel implicit identity representation conditioned memory compensation network, coined as MCNet, for high-fidelity talking head generation.~Specifically, we devise a network module to learn a unified spatial facial meta-memory bank from all training samples, which can provide rich facial structure and appearance priors to compensate warped source facial features for the generation. Furthermore, we propose an effective query mechanism based on implicit identity representations learned from the discrete keypoints of the source image. It can greatly facilitate the retrieval of more correlated information from the memory bank for the compensation. Extensive experiments demonstrate that MCNet can learn representative and complementary facial memory, and can clearly outperform previous state-of-the-art talking head generation methods on VoxCeleb1 and CelebV datasets. Please check our https://github.com/harlanhong/ICCV2023-MCNET{Project}.
Community
Proposes MCNet: implicit identity conditioned memory compensation network; apply poses of a driving image to the identity from a source image; first learns motion flow (between source and driving), then memory bank conditioned on source identity (implicit keypoint representations); high-fidelity (quality) talking head video generation. A keypoint detector and dense motion network predicts keypoint locations on source and driving image and movement; implicit identity conditioned memory module (IICM) contains memory bank; memory compensation module (MCM) has dynamic cross attention to output warped source feature map; decoder uses feature maps to generate results. Global facial meta-memory bank initialized as cube tensor (C, H, W). Implicit identity representation: Take projected feature channels (produced from warped feature) and source keypoints, spatially GAP (global average pool) features and flatten keypoints, concatenate results, give to MLP. Manipulated convolution using implicit identity representation (inspired by StyleGANv2). Memory compensation module (MCM) for inpainting occluded parts of face; source warped features are split in two (along the channel, first half direct pass through for identity preserving) - second half goes through 1x1 conv (channel project); dynamic cross-attention (query from this and key, value from identity-dependent memory - all projected non-linearly), result (compensated features) is concat. Regularisation loss on consistency of value facet (from dynamic cross-attention) and projected features (before conv of the query facet); no gradient over query. Generate in a multi-layer (scale) fashion; see Fig 2 for overview, fig 3 for IICM, fig 4 for MCM. Losses from FOMM (first order motion model - prior work): perceptual loss (model output and driving image), equivariance and keypoint distance loss (for stable and uniform keypoints). Compared with FOMM and DaGAN (on VoxCeleb1 and CelebV); used metrics SSIM, PSNR, LPIPS, AKD and AED (average keypoint and Euclidean distance); AUCON and PRMSE for cross-identity generation (from DaGAN). Better than TPSN and MRAA on cross-identity generation (qualitative results in fig 5); better preservation on same-identity generation. Ablations show that IICM and MCM show improvement over baseline, FOMM, and TPSN on cross-identity data; channel splitting for MCM is necessary; also has memory bank representations. Appendix has implementation and architecture details, optimization loss formulations, details on evaluation metrics, more ablations, and qualitative visualizations. From HKUST.
Links: website (Prior work: DaGAN), PapersWithCode, GitHub
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper