Joint Learning of Facial Expression and Head Pose from Speech

被引:16
|
作者
Greenwood, David [1 ]
Matthews, Iain [1 ]
Laycock, Stephen [1 ]
机构
[1] Univ East Anglia, Sch Comp Sci, Norwich, Norfolk, England
关键词
Speech Animation; Deep Learning; LSTM; BLSTM; RNN; Audiovisual Speech; Shape Modelling; Lip Sync; Uncanny Valley; Visual Prosody;
D O I
10.21437/Interspeech.2018-2587
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Natural movement plays a significant role in realistic speech animation, and numerous studies have demonstrated the contribution visual cues make to the degree human observers find an animation acceptable. Natural, expressive, emotive, and prosodic speech exhibits motion patterns that are difficult to predict with considerable variation in visual modalities. Recently, there have been some impressive demonstrations of face animation derived in some way from the speech signal. Each of these methods have taken unique approaches, but none have included rigid head pose in their predicted output. We observe a high degree of correspondence with facial activity and rigid head pose during speech, and exploit this observation to jointly learn full face animation and head pose rotation and translation combined. From our own corpus, we train Deep Bi-Directional LSTMs (BLSTM) capable of learning long-term structure in language to model the relationship that speech has with the complex activity of the face. We define a model architecture to encourage learning of rigid head motion via the latent space of the speaker's facial activity. The result is a model that can predict lip sync and other facial motion along with rigid head motion directly from audible speech.
引用
收藏
页码:2484 / 2488
页数:5
相关论文
共 50 条
  • [31] Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion
    Karras, Tero
    Aila, Timo
    Laine, Samuli
    Herva, Antti
    Lehtinen, Jaakko
    ACM TRANSACTIONS ON GRAPHICS, 2017, 36 (04):
  • [32] WNet: Joint Multiple Head Detection and Head Pose Estimation from a Spectator Crowd Image
    Jan, Yasir
    Sohel, Ferdous
    Shiratuddin, Mohd Fairuz
    Wong, Kok Wai
    COMPUTER VISION - ACCV 2018 WORKSHOPS, 2019, 11367 : 484 - 493
  • [33] Expressive facial animation synthesis by learning speech coarticulation and expression spaces
    Deng, Zhigang
    Neumann, Ulrich
    Lewis, J. P.
    Kim, Tae-Yong
    Bulut, Murtaza
    Narayanan, Shrikanth
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2006, 12 (06) : 1523 - 1534
  • [34] Expressive facial animation synthesis by learning speech coarticulation and expression spaces
    IEEE Computer Society
    不详
    不详
    不详
    不详
    不详
    不详
    IEEE Trans Visual Comput Graphics, 2006, 6 (1523-1534):
  • [35] Head Pose Estimation and Movement Analysis for Speech Scene
    Komiya, Rinko
    Saitoh, Takeshi
    Fuyuno, Miharu
    Yamashita, Yuko
    Nakajima, Yoshitaka
    2016 IEEE/ACIS 15TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS), 2016, : 409 - 413
  • [36] Cross-pose Facial Expression Recognition
    Guney, Fatma
    Arar, Nuri Murat
    Fischer, Mika
    Ekenel, Hazim Kemal
    2013 10TH IEEE INTERNATIONAL CONFERENCE AND WORKSHOPS ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG), 2013,
  • [37] Separability of pose and expression in facial tracking and animation
    Bascle, B
    Blake, A
    SIXTH INTERNATIONAL CONFERENCE ON COMPUTER VISION, 1998, : 323 - 328
  • [38] Disentangling Identity and Pose for Facial Expression Recognition
    Jiang, Jing
    Deng, Weihong
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2022, 13 (04) : 1868 - 1878
  • [39] POSE INVARIANT ROBUST FACIAL EXPRESSION ANALYSIS
    Win, Khin Thu Zar
    Chen, Fan
    Izawa, Junko
    Kotani, Kazunori
    2010 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, 2010, : 3837 - 3840
  • [40] Pose-Invariant Facial Expression Recognition
    Liang, Guang
    Wang, Shangfei
    Wang, Can
    2021 16TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2021), 2021,