VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

被引：0

作者：

Liu, Li ^{[1
]}

Wang, Jinhui ^{[1
]}

Chen, Shijuan ^{[1
]}

Li, Zongmei ^{[1
]}

机构：

[1] Xiamen Univ Technol, Dept Comp & Informat Engn, Xiamen 361204, Peoples R China

来源：

ELECTRONICS | 2024年 / 13卷 / 18期

关键词：

facial animation generation; speech-driven lip synchronization; cross-attention mechanism; feature fusion;

D O I：

10.3390/electronics13183657

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Speech-driven lip synchronization is a crucial technology for generating realistic facial animations, with broad application prospects in virtual reality, education, training, and other fields. However, existing methods still face challenges in generating high-fidelity facial animations, particularly in addressing lip jitter and facial motion instability issues in continuous frame sequences. This study presents VividWav2Lip, an improved speech-driven lip synchronization model. Our model incorporates three key innovations: a cross-attention mechanism for enhanced audio-visual feature fusion, an optimized network structure with Squeeze-and-Excitation (SE) residual blocks, and the integration of the CodeFormer facial restoration network for post-processing. Extensive experiments were conducted on a diverse dataset comprising multiple languages and facial types. Quantitative evaluations demonstrate that VividWav2Lip outperforms the baseline Wav2Lip model by 5% in lip sync accuracy and image generation quality, with even more significant improvements over other mainstream methods. In subjective assessments, 85% of participants perceived VividWav2Lip-generated animations as more realistic compared to those produced by existing techniques. Additional experiments reveal our model's robust cross-lingual performance, maintaining consistent quality even for languages not included in the training set. This study not only advances the theoretical foundations of audio-driven lip synchronization but also offers a practical solution for high-fidelity, multilingual dynamic face generation, with potential applications spanning virtual assistants, video dubbing, and personalized content creation.

引用

页数：19

共 22 条

[21] High-definition multi-scale voice-driven facial animation: enhancing lip-sync clarity and image detail
Zhang, Long
Zhou, QingHua
Tang, Shuai
Chen, Yunxiang
VISUAL COMPUTER, 2024,
[22] Data-driven high-fidelity 2D microstructure reconstruction via non-local patch-based image inpainting
Anh Tran
Hoang Tran
ACTA MATERIALIA, 2019, 178 : 207 - 218

← 1 2 3 →