VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

被引:0
|
作者
Liu, Li [1 ]
Wang, Jinhui [1 ]
Chen, Shijuan [1 ]
Li, Zongmei [1 ]
机构
[1] Xiamen Univ Technol, Dept Comp & Informat Engn, Xiamen 361204, Peoples R China
关键词
facial animation generation; speech-driven lip synchronization; cross-attention mechanism; feature fusion;
D O I
10.3390/electronics13183657
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech-driven lip synchronization is a crucial technology for generating realistic facial animations, with broad application prospects in virtual reality, education, training, and other fields. However, existing methods still face challenges in generating high-fidelity facial animations, particularly in addressing lip jitter and facial motion instability issues in continuous frame sequences. This study presents VividWav2Lip, an improved speech-driven lip synchronization model. Our model incorporates three key innovations: a cross-attention mechanism for enhanced audio-visual feature fusion, an optimized network structure with Squeeze-and-Excitation (SE) residual blocks, and the integration of the CodeFormer facial restoration network for post-processing. Extensive experiments were conducted on a diverse dataset comprising multiple languages and facial types. Quantitative evaluations demonstrate that VividWav2Lip outperforms the baseline Wav2Lip model by 5% in lip sync accuracy and image generation quality, with even more significant improvements over other mainstream methods. In subjective assessments, 85% of participants perceived VividWav2Lip-generated animations as more realistic compared to those produced by existing techniques. Additional experiments reveal our model's robust cross-lingual performance, maintaining consistent quality even for languages not included in the training set. This study not only advances the theoretical foundations of audio-driven lip synchronization but also offers a practical solution for high-fidelity, multilingual dynamic face generation, with potential applications spanning virtual assistants, video dubbing, and personalized content creation.
引用
收藏
页数:19
相关论文
共 22 条
  • [1] Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video
    Wu, Xiuzhe
    Hu, Pengfei
    Wu, Yang
    Lyu, Xiaoyang
    Cao, Yan-Pei
    Shan, Ying
    Yang, Wenming
    Sun, Zhongqian
    Qi, Xiaojuan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22111 - 22120
  • [2] A study on auditory feature spaces for speech-driven lip animation
    Le-Jan, Guylaine
    Benezeth, Yannick
    Gravier, Guillaume
    Bimbot, Frederic
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2508 - 2511
  • [3] Automated lip synchronized speech driven facial animation
    Melek, Z
    Akarun, L
    2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 623 - 626
  • [4] Speech-driven Lip Motion Generation with a Trajectory HMM
    Hofer, Gregor
    Yamagishi, Junichi
    Shimodaira, Hiroshi
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2314 - 2317
  • [5] Evaluation of a formant-based speech-driven lip motion generation
    Ishi, Carlos T.
    Liu, Chaoran
    Ishiguro, Hiroshi
    Hagita, Norihiro
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 114 - 117
  • [6] High-Fidelity Facial and Speech Animation for VR HMDs
    Olszewski, Kyle
    Lim, Joseph J.
    Saito, Shunsuke
    Li, Hao
    ACM TRANSACTIONS ON GRAPHICS, 2016, 35 (06):
  • [7] Parallel and High-Fidelity Text-to-Lip Generation
    Liu, Jinglin
    Zhu, Zhiying
    Ren, Yi
    Huang, Wencan
    Huai, Baoxing
    Yuan, Nicholas
    Zhao, Zhou
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1738 - 1746
  • [8] Speech driven facial animation generation based on GAN
    Li, Xiong
    Zhang, Jiye
    Liu, Yazhi
    DISPLAYS, 2022, 74
  • [9] CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation
    Liang, Xiangyu
    Zhuang, Wenlin
    Wang, Tianyong
    Geng, Guangxing
    Geng, Guangyue
    Xia, Haifeng
    Xia, Siyu
    2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,
  • [10] Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation
    Bozkurt, Elif
    Erdem, Cigdem Eroglu
    Erzin, Engin
    Erdem, Tanju
    Oezkan, Mehmet
    2007 3DTV CONFERENCE, 2007, : 85 - +