Emotional Speech-Driven Animation with Content-Emotion Disentanglement

被引:4
|
作者
Danecek, Radek [1 ]
Chhatre, Kiran [2 ]
Tripathi, Shashank [1 ]
Wen, Yandong [1 ]
Black, Michael [1 ]
Bolkart, Timo [1 ,3 ]
机构
[1] MPI Intelligent Syst, Tubingen, Germany
[2] KTH Royal Inst Technol, Stockholm, Sweden
[3] Google, Mountain View, CA 94043 USA
关键词
Speech-driven Animation; Facial Animation; Computer Vision; Computer Graphics; Deep learning; 3D FACE RECONSTRUCTION;
D O I
10.1145/3610548.3618183
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE (Expressive Model Optimized for Talking with Emotion), which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. To achieve this, we supervise EMOTE with decoupled losses for speech (i.e., lip-sync) and emotion. These losses are based on two key observations: (1) deformations of the face due to speech are spatially localized around the mouth and have high temporal frequency, whereas (2) facial expressions may deform the whole face and occur over longer intervals. Thus we train EMOTE with a per-frame lip-reading loss to preserve the speech-dependent content, while supervising emotion at the sequence level. Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotions on the same audio, while maintaining the lip motion synchronized with the speech. To employ deep perceptual losses without getting undesirable artifacts, we devise a motion prior in the form of a temporal VAE. Due to the absence of high-quality aligned emotional 3D face datasets with speech, EMOTE is trained with 3D pseudo-ground-truth extracted from an emotional video dataset (i.e., MEAD). Extensive qualitative and perceptual evaluations demonstrate that EMOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data, while offering additional, high-quality emotional control.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] Speech-driven 3D Facial Animation for Mobile Entertainment
    Yan, Juan
    Xie, Xiang
    Hu, Hao
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2334 - 2337
  • [32] FaceFormer: Speech-Driven 3D Facial Animation with Transformers
    Fan, Yingruo
    Lin, Zhaojiang
    Saito, Jun
    Wang, Wenping
    Komura, Taku
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18749 - 18758
  • [33] A Research on Facial Animation Driven by Emotional Speech
    Lixiang, Li
    ICCSSE 2009: PROCEEDINGS OF 2009 4TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION, 2009, : 118 - 121
  • [34] CLTalk: Speech-Driven 3D Facial Animation with Contrastive Learning
    Zhang, Xitie
    Wu, Suping
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 1175 - 1179
  • [35] Real-time speech-driven 3D face animation
    Hong, PY
    Wen, Z
    Huang, TS
    Shum, HY
    FIRST INTERNATIONAL SYMPOSIUM ON 3D DATA PROCESSING VISUALIZATION AND TRANSMISSION, 2002, : 713 - 716
  • [36] Geometry-Guided Dense Perspective Network for Speech-Driven Facial Animation
    Liu, Jingying
    Hui, Binyuan
    Li, Kun
    Liu, Yunke
    Lai, Yu-Kun
    Zhang, Yuxiang
    Liu, Yebin
    Yang, Jingyu
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2022, 28 (12) : 4873 - 4886
  • [37] Text-driven Speech Animation with Emotion Control
    Chae, Wonseok
    Kim, Yejin
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2020, 14 (08): : 3473 - 3487
  • [38] ANALYZING VISIBLE ARTICULATORY MOVEMENTS IN SPEECH PRODUCTION FOR SPEECH-DRIVEN 3D FACIAL ANIMATION
    Kim, Hyung Kyu
    Lee, Sangmin
    Kim, Hak Gu
    Proceedings - International Conference on Image Processing, ICIP, 2024, : 3575 - 3579
  • [39] Speech-driven Embodied Entrainment Character System with Emotional Expressions and Motions by Speech Recognition
    Kohara, Mizuki
    Shikata, Hiraku
    Watanabe, Tomio
    Ishii, Yutaka
    2014 IEEE/SICE INTERNATIONAL SYMPOSIUM ON SYSTEM INTEGRATION (SII), 2014, : 431 - 435
  • [40] CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior
    Xing, Jinbo
    Xia, Menghan
    Zhang, Yuechen
    Cun, Xiaodong
    Wang, Jue
    Wong, Tien-Tsin
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 12780 - 12790