VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

被引:0
|
作者
Liu, Li [1 ]
Wang, Jinhui [1 ]
Chen, Shijuan [1 ]
Li, Zongmei [1 ]
机构
[1] Xiamen Univ Technol, Dept Comp & Informat Engn, Xiamen 361204, Peoples R China
关键词
facial animation generation; speech-driven lip synchronization; cross-attention mechanism; feature fusion;
D O I
10.3390/electronics13183657
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech-driven lip synchronization is a crucial technology for generating realistic facial animations, with broad application prospects in virtual reality, education, training, and other fields. However, existing methods still face challenges in generating high-fidelity facial animations, particularly in addressing lip jitter and facial motion instability issues in continuous frame sequences. This study presents VividWav2Lip, an improved speech-driven lip synchronization model. Our model incorporates three key innovations: a cross-attention mechanism for enhanced audio-visual feature fusion, an optimized network structure with Squeeze-and-Excitation (SE) residual blocks, and the integration of the CodeFormer facial restoration network for post-processing. Extensive experiments were conducted on a diverse dataset comprising multiple languages and facial types. Quantitative evaluations demonstrate that VividWav2Lip outperforms the baseline Wav2Lip model by 5% in lip sync accuracy and image generation quality, with even more significant improvements over other mainstream methods. In subjective assessments, 85% of participants perceived VividWav2Lip-generated animations as more realistic compared to those produced by existing techniques. Additional experiments reveal our model's robust cross-lingual performance, maintaining consistent quality even for languages not included in the training set. This study not only advances the theoretical foundations of audio-driven lip synchronization but also offers a practical solution for high-fidelity, multilingual dynamic face generation, with potential applications spanning virtual assistants, video dubbing, and personalized content creation.
引用
收藏
页数:19
相关论文
共 22 条
  • [21] High-definition multi-scale voice-driven facial animation: enhancing lip-sync clarity and image detail
    Zhang, Long
    Zhou, QingHua
    Tang, Shuai
    Chen, Yunxiang
    VISUAL COMPUTER, 2024,
  • [22] Data-driven high-fidelity 2D microstructure reconstruction via non-local patch-based image inpainting
    Anh Tran
    Hoang Tran
    ACTA MATERIALIA, 2019, 178 : 207 - 218