Emotional Speech-Driven Animation with Content-Emotion Disentanglement

被引:4
|
作者
Danecek, Radek [1 ]
Chhatre, Kiran [2 ]
Tripathi, Shashank [1 ]
Wen, Yandong [1 ]
Black, Michael [1 ]
Bolkart, Timo [1 ,3 ]
机构
[1] MPI Intelligent Syst, Tubingen, Germany
[2] KTH Royal Inst Technol, Stockholm, Sweden
[3] Google, Mountain View, CA 94043 USA
关键词
Speech-driven Animation; Facial Animation; Computer Vision; Computer Graphics; Deep learning; 3D FACE RECONSTRUCTION;
D O I
10.1145/3610548.3618183
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE (Expressive Model Optimized for Talking with Emotion), which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. To achieve this, we supervise EMOTE with decoupled losses for speech (i.e., lip-sync) and emotion. These losses are based on two key observations: (1) deformations of the face due to speech are spatially localized around the mouth and have high temporal frequency, whereas (2) facial expressions may deform the whole face and occur over longer intervals. Thus we train EMOTE with a per-frame lip-reading loss to preserve the speech-dependent content, while supervising emotion at the sequence level. Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotions on the same audio, while maintaining the lip motion synchronized with the speech. To employ deep perceptual losses without getting undesirable artifacts, we devise a motion prior in the form of a temporal VAE. Due to the absence of high-quality aligned emotional 3D face datasets with speech, EMOTE is trained with 3D pseudo-ground-truth extracted from an emotional video dataset (i.e., MEAD). Extensive qualitative and perceptual evaluations demonstrate that EMOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data, while offering additional, high-quality emotional control.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] Speech-driven facial animation with spectral gathering and temporal attention
    CHAI Yujin
    WENG Yanlin
    WANG Lvdi
    ZHOU Kun
    Frontiers of Computer Science, 2022, 16 (03)
  • [22] SYNTHESIZING REAL-TIME SPEECH-DRIVEN FACIAL ANIMATION
    Luo, Changwei
    Yu, Jun
    Wang, Zengfu
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [23] SynFace—Speech-Driven Facial Animation for Virtual Speech-Reading Support
    Giampiero Salvi
    Jonas Beskow
    Samer Al Moubayed
    Björn Granström
    EURASIP Journal on Audio, Speech, and Music Processing, 2009
  • [24] Speech-Driven 3D Facial Animation with Mesh Convolution
    Ji, Xuejie
    Su, Zewei
    Dong, Lanfang
    Li, Guoming
    2022 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, COMPUTER VISION AND MACHINE LEARNING (ICICML), 2022, : 14 - 18
  • [25] Real-time speech-driven animation of expressive talking faces
    Liu, Jia
    You, Mingyu
    Chen, Chun
    Song, Mingli
    INTERNATIONAL JOURNAL OF GENERAL SYSTEMS, 2011, 40 (04) : 439 - 455
  • [26] Speech-Driven Facial Animation by LSTM-RNN for Communication Use
    Nishimura, Ryosuke
    Sakata, Nobuchika
    Tominaga, Tomu
    Hijikata, Yoshinori
    Harada, Kensuke
    Kiyokawa, Kiyoshi
    2019 12TH ASIA PACIFIC WORKSHOP ON MIXED AND AUGMENTED REALITY (APMAR), 2019, : 22 - 29
  • [27] Imitator: Personalized Speech-driven 3D Facial Animation
    Thambiraja, Balamurugan
    Habibie, Ikhsanul
    Aliakbarian, Sadegh
    Cosker, Darren
    Theobalt, Christian
    Thies, Justus
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 20564 - 20574
  • [28] Speech-Driven Facial Animation by LSTM-RNN for Communication Use
    Nishimura, Ryosuke
    Sakata, Nobuchika
    Tominaga, Tomu
    Hijikata, Yoshinori
    Harada, Kensuke
    Kiyokawa, Kiyoshi
    2019 26TH IEEE CONFERENCE ON VIRTUAL REALITY AND 3D USER INTERFACES (VR), 2019, : 1102 - 1103
  • [29] REALTIME SPEECH-DRIVEN FACIAL ANIMATION USING GAUSSIAN MIXTURE MODELS
    Luo, Changwei
    Yu, Jun
    Li, Xian
    Wang, Zengfu
    2014 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2014,
  • [30] HMM BASED SPEECH-DRIVEN 3D TONGUE ANIMATION
    Luo, Changwei
    Yu, Jun
    Li, Xian
    Zhang, Leilei
    2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 4377 - 4381