Emotional Speech-Driven Animation with Content-Emotion Disentanglement

被引:4
|
作者
Danecek, Radek [1 ]
Chhatre, Kiran [2 ]
Tripathi, Shashank [1 ]
Wen, Yandong [1 ]
Black, Michael [1 ]
Bolkart, Timo [1 ,3 ]
机构
[1] MPI Intelligent Syst, Tubingen, Germany
[2] KTH Royal Inst Technol, Stockholm, Sweden
[3] Google, Mountain View, CA 94043 USA
关键词
Speech-driven Animation; Facial Animation; Computer Vision; Computer Graphics; Deep learning; 3D FACE RECONSTRUCTION;
D O I
10.1145/3610548.3618183
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE (Expressive Model Optimized for Talking with Emotion), which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. To achieve this, we supervise EMOTE with decoupled losses for speech (i.e., lip-sync) and emotion. These losses are based on two key observations: (1) deformations of the face due to speech are spatially localized around the mouth and have high temporal frequency, whereas (2) facial expressions may deform the whole face and occur over longer intervals. Thus we train EMOTE with a per-frame lip-reading loss to preserve the speech-dependent content, while supervising emotion at the sequence level. Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotions on the same audio, while maintaining the lip motion synchronized with the speech. To employ deep perceptual losses without getting undesirable artifacts, we devise a motion prior in the form of a temporal VAE. Due to the absence of high-quality aligned emotional 3D face datasets with speech, EMOTE is trained with 3D pseudo-ground-truth extracted from an emotional video dataset (i.e., MEAD). Extensive qualitative and perceptual evaluations demonstrate that EMOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data, while offering additional, high-quality emotional control.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion
    Stan, Stefan
    Haque, Kazi Injamamul
    Yumak, Zerrin
    15TH ANNUAL ACM SIGGRAPH CONFERENCE ON MOTION, INTERACTION AND GAMES, MIG 2023, 2023,
  • [42] Audio-to-Visual Conversion Via HMM Inversion for Speech-Driven Facial Animation
    Terissi, Lucas D.
    Gomez, Juan Carlos
    ADVANCES IN ARTIFICIAL INTELLIGENCE - SBIA 2008, PROCEEDINGS, 2008, 5249 : 33 - 42
  • [43] KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding
    Xu, Zhihao
    Gong, Shengjie
    Tang, Jiapeng
    Liang, Lingyu
    Huang, Yining
    Li, Haojie
    Huang, Shuangping
    COMPUTER VISION - ECCV 2024, PT LVI, 2025, 15114 : 236 - 253
  • [44] Real-time speech-driven face animation with expressions using neural networks
    Hong, PY
    Wen, Z
    Huang, TS
    IEEE TRANSACTIONS ON NEURAL NETWORKS, 2002, 13 (04): : 916 - 927
  • [45] Speech-Driven 3D Face Animation with Composite and Regional Facial Movements
    Wu, Haozhe
    Zhou, Songtao
    Jia, Jia
    Xing, Junliang
    Wen, Qi
    Wen, Xiang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6822 - 6830
  • [46] Speech-Driven Facial Animation Using a Shared Gaussian Process Latent Variable Model
    Deena, Salil
    Galata, Aphrodite
    ADVANCES IN VISUAL COMPUTING, PT 1, PROCEEDINGS, 2009, 5875 : 89 - 100
  • [47] Emotion-Aware Audio-Driven Face Animation via Contrastive Feature Disentanglement
    Ren, Xin
    Luo, Juan
    Zhong, Xionghu
    Cai, Minjie
    INTERSPEECH 2023, 2023, : 2728 - 2732
  • [48] NewTalker: Exploring frequency domain for speech-driven 3D facial animation with Mamba
    Niu, Weiran
    Wang, Zan
    Li, Yi
    Lou, Tangtang
    IET Image Processing, 2025, 19 (01)
  • [49] Comparing text-driven and speech-driven visual speech synthesisers
    Theobald, Barry-John
    Cawley, Gavin
    Bangham, Andrew
    Matthews, Iain
    Wilkinson, Nicholas
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2322 - 2322
  • [50] Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation
    Fan, Yingruo
    Lin, Zhaojiang
    Saito, Jun
    Wang, Wenping
    Komura, Taku
    PROCEEDINGS OF THE ACM ON COMPUTER GRAPHICS AND INTERACTIVE TECHNIQUES, 2022, 5 (01)