VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

被引：0

作者：

Liu, Li ^{[1
]}

Wang, Jinhui ^{[1
]}

Chen, Shijuan ^{[1
]}

Li, Zongmei ^{[1
]}

机构：

[1] Xiamen Univ Technol, Dept Comp & Informat Engn, Xiamen 361204, Peoples R China

来源：

ELECTRONICS | 2024年 / 13卷 / 18期

关键词：

facial animation generation; speech-driven lip synchronization; cross-attention mechanism; feature fusion;

D O I：

10.3390/electronics13183657

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Speech-driven lip synchronization is a crucial technology for generating realistic facial animations, with broad application prospects in virtual reality, education, training, and other fields. However, existing methods still face challenges in generating high-fidelity facial animations, particularly in addressing lip jitter and facial motion instability issues in continuous frame sequences. This study presents VividWav2Lip, an improved speech-driven lip synchronization model. Our model incorporates three key innovations: a cross-attention mechanism for enhanced audio-visual feature fusion, an optimized network structure with Squeeze-and-Excitation (SE) residual blocks, and the integration of the CodeFormer facial restoration network for post-processing. Extensive experiments were conducted on a diverse dataset comprising multiple languages and facial types. Quantitative evaluations demonstrate that VividWav2Lip outperforms the baseline Wav2Lip model by 5% in lip sync accuracy and image generation quality, with even more significant improvements over other mainstream methods. In subjective assessments, 85% of participants perceived VividWav2Lip-generated animations as more realistic compared to those produced by existing techniques. Additional experiments reveal our model's robust cross-lingual performance, maintaining consistent quality even for languages not included in the training set. This study not only advances the theoretical foundations of audio-driven lip synchronization but also offers a practical solution for high-fidelity, multilingual dynamic face generation, with potential applications spanning virtual assistants, video dubbing, and personalized content creation.

引用

页数：19

共 22 条

[1] Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video
Wu, Xiuzhe
Hu, Pengfei
Wu, Yang
Lyu, Xiaoyang
Cao, Yan-Pei
Shan, Ying
Yang, Wenming
Sun, Zhongqian
Qi, Xiaojuan
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22111 - 22120
[2] A study on auditory feature spaces for speech-driven lip animation
Le-Jan, Guylaine
Benezeth, Yannick
Gravier, Guillaume
Bimbot, Frederic
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2508 - 2511
[3] Automated lip synchronized speech driven facial animation
Melek, Z
Akarun, L
2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 623 - 626
[4] Speech-driven Lip Motion Generation with a Trajectory HMM
Hofer, Gregor
Yamagishi, Junichi
Shimodaira, Hiroshi
INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2314 - 2317
[5] Evaluation of a formant-based speech-driven lip motion generation
Ishi, Carlos T.
Liu, Chaoran
Ishiguro, Hiroshi
Hagita, Norihiro
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 114 - 117
[6] High-Fidelity Facial and Speech Animation for VR HMDs
Olszewski, Kyle
Lim, Joseph J.
Saito, Shunsuke
Li, Hao
ACM TRANSACTIONS ON GRAPHICS, 2016, 35 (06):
[7] Parallel and High-Fidelity Text-to-Lip Generation
Liu, Jinglin
Zhu, Zhiying
Ren, Yi
Huang, Wencan
Huai, Baoxing
Yuan, Nicholas
Zhao, Zhou
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1738 - 1746
[8] Speech driven facial animation generation based on GAN
Li, Xiong
Zhang, Jiye
Liu, Yazhi
DISPLAYS, 2022, 74
[9] CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation
Liang, Xiangyu
Zhuang, Wenlin
Wang, Tianyong
Geng, Guangxing
Geng, Guangyue
Xia, Haifeng
Xia, Siyu
2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,
[10] Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation
Bozkurt, Elif
Erdem, Cigdem Eroglu
Erzin, Engin
Erdem, Tanju
Oezkan, Mehmet
2007 3DTV CONFERENCE, 2007, : 85 - +

← 1 2 3 →