Emotional Speech-Driven Animation with Content-Emotion Disentanglement

被引:4
|
作者
Danecek, Radek [1 ]
Chhatre, Kiran [2 ]
Tripathi, Shashank [1 ]
Wen, Yandong [1 ]
Black, Michael [1 ]
Bolkart, Timo [1 ,3 ]
机构
[1] MPI Intelligent Syst, Tubingen, Germany
[2] KTH Royal Inst Technol, Stockholm, Sweden
[3] Google, Mountain View, CA 94043 USA
关键词
Speech-driven Animation; Facial Animation; Computer Vision; Computer Graphics; Deep learning; 3D FACE RECONSTRUCTION;
D O I
10.1145/3610548.3618183
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE (Expressive Model Optimized for Talking with Emotion), which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. To achieve this, we supervise EMOTE with decoupled losses for speech (i.e., lip-sync) and emotion. These losses are based on two key observations: (1) deformations of the face due to speech are spatially localized around the mouth and have high temporal frequency, whereas (2) facial expressions may deform the whole face and occur over longer intervals. Thus we train EMOTE with a per-frame lip-reading loss to preserve the speech-dependent content, while supervising emotion at the sequence level. Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotions on the same audio, while maintaining the lip motion synchronized with the speech. To employ deep perceptual losses without getting undesirable artifacts, we devise a motion prior in the form of a temporal VAE. Due to the absence of high-quality aligned emotional 3D face datasets with speech, EMOTE is trained with 3D pseudo-ground-truth extracted from an emotional video dataset (i.e., MEAD). Extensive qualitative and perceptual evaluations demonstrate that EMOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data, while offering additional, high-quality emotional control.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation
    Peng, Ziqiao
    Wu, Haoyu
    Song, Zhenbo
    Xu, Hao
    Zhu, Xiangyu
    He, Jun
    Liu, Hongyan
    Fan, Zhaoxin
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 20630 - 20640
  • [2] Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation
    Fu, Hui
    Wang, Zeqing
    Gong, Ke
    Wang, Keze
    Chen, Tianshui
    Li, Haojie
    Zeng, Haifeng
    Kang, Wenxiong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 1770 - 1777
  • [3] Speech-driven animation with meaningful behaviors
    Sadoughi, Najmeh
    Busso, Carlos
    SPEECH COMMUNICATION, 2019, 110 : 90 - 100
  • [4] Expressive speech-driven facial animation
    Cao, Y
    Tien, WC
    Faloutsos, P
    Pighin, F
    ACM TRANSACTIONS ON GRAPHICS, 2005, 24 (04): : 1283 - 1302
  • [5] Extracting emotion from speech: Towards emotional speech-driven facial animations
    Aina, OO
    Hartmann, K
    Strothotte, T
    SMART GRAPHICS, PROCEEDINGS, 2003, 2733 : 162 - 171
  • [6] Speech-driven facial animation with realistic dynamics
    Gutierrez-Osuna, R
    Kakumanu, PK
    Esposito, A
    Garcia, ON
    Bojorquez, A
    Castillo, JL
    Rudomin, I
    IEEE TRANSACTIONS ON MULTIMEDIA, 2005, 7 (01) : 33 - 42
  • [7] Realistic Speech-Driven Facial Animation with GANs
    Konstantinos Vougioukas
    Stavros Petridis
    Maja Pantic
    International Journal of Computer Vision, 2020, 128 : 1398 - 1413
  • [8] Realistic Speech-Driven Facial Animation with GANs
    Vougioukas, Konstantinos
    Petridis, Stavros
    Pantic, Maja
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2020, 128 (05) : 1398 - 1413
  • [9] SPACE : Speech-driven Portrait Animation with Controllable Expression
    Gururani, Siddharth
    Mallya, Arun
    Wang, Ting-Chun
    Valle, Rafael
    Liu, Ming-Yu
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 20857 - 20866
  • [10] Speech-driven facial animation using a hierarchical model
    Cosker, DP
    Marshall, AD
    Rosin, PL
    Hicks, YA
    IEE PROCEEDINGS-VISION IMAGE AND SIGNAL PROCESSING, 2004, 151 (04): : 314 - 321