Spatially and Temporally Optimized Audio-Driven Talking Face Generation

被引:0
|
作者
Dong, Biao [1 ]
Ma, Bo-Yao [1 ]
Zhang, Lei [1 ]
机构
[1] Beijing Inst Technol, Beijing, Peoples R China
基金
国家重点研发计划;
关键词
NETWORK;
D O I
10.1111/cgf.15228
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Audio-driven talking face generation is essentially a cross-modal mapping from audio to video frames. The main challenge lies in the intricate one-to-many mapping, which affects lip sync accuracy. And the loss of facial details during image reconstruction often results in visual artifacts in the generated video. To overcome these challenges, this paper proposes to enhance the quality of generated talking faces with a new spatio-temporal consistency. Specifically, the temporal consistency is achieved through consecutive frames of the each phoneme, which form temporal modules that exhibit similar lip appearance changes. This allows for adaptive adjustment in the lip movement for accurate sync. The spatial consistency pertains to the uniform distribution of textures within local regions, which form spatial modules and regulate the texture distribution in the generator. This yields fine details in the reconstructed facial images. Extensive experiments show that our method can generate more natural talking faces than previous state-of-the-art methods in both accurate lip sync and realistic facial details.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Audio-Driven Stylized Gesture Generation with Flow-Based Model
    Ye, Sheng
    Wen, Yu-Hui
    Sun, Yanan
    He, Ying
    Zhang, Ziyang
    Wang, Yaoyuan
    He, Weihua
    Liu, Yong-Jin
    COMPUTER VISION - ECCV 2022, PT V, 2022, 13665 : 712 - 728
  • [32] Let's Play Music: Audio-driven Performance Video Generation
    Zhu, Hao
    Li, Yi
    Zhu, Feixia
    Zheng, Aihua
    He, Ran
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3574 - 3581
  • [33] Photorealistic Audio-driven Video Portraits
    Wen, Xin
    Wang, Miao
    Richardt, Christian
    Chen, Ze-Yin
    Hu, Shi-Min
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2020, 26 (12) : 3457 - 3466
  • [34] Audio-Driven Emotional Video Portraits
    Ji, Xinya
    Zhou, Hang
    Wang, Kaisiyuan
    Wu, Wayne
    Loy, Chen Change
    Cao, Xun
    Xu, Feng
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 14075 - 14084
  • [35] Audio-Driven Laughter Behavior Controller
    Ding, Yu
    Huang, Jing
    Pelachaud, Catherine
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2017, 8 (04) : 546 - 558
  • [36] Voice2Face: Audio-driven Facial and Tongue Rig Animations with cVAEs
    Aylagas, Monica Villanueva
    Leon, Hector Anadon
    Teye, Mattias
    Tollmar, Konrad
    COMPUTER GRAPHICS FORUM, 2022, 41 (08) : 255 - 265
  • [37] Talking Face Generation With Audio-Deduced Emotional Landmarks
    Zhai, Shuyan
    Liu, Meng
    Li, Yongqiang
    Gao, Zan
    Zhu, Lei
    Nie, Liqiang
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (10) : 14099 - 14111
  • [38] Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation
    Zhu, Lingting
    Liu, Xian
    Liu, Xuanyu
    Qian, Rui
    Liu, Ziwei
    Yu, Lequan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10544 - 10553
  • [39] Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation
    Liu, Xian
    Xu, Yinghao
    Wu, Qianyi
    Zhou, Hang
    Wu, Wayne
    Zhou, Bolei
    COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 106 - 125
  • [40] Multimodal Learning for Temporally Coherent Talking Face Generation With Articulator Synergy
    Yu, Lingyun
    Xie, Hongtao
    Zhang, Yongdong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 2950 - 2962