Joint Learning of Facial Expression and Head Pose from Speech

被引：16

作者：

Greenwood, David ^{[1
]}

Matthews, Iain ^{[1
]}

Laycock, Stephen ^{[1
]}

机构：

[1] Univ East Anglia, Sch Comp Sci, Norwich, Norfolk, England

来源：

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES | 2018年

关键词：

Speech Animation; Deep Learning; LSTM; BLSTM; RNN; Audiovisual Speech; Shape Modelling; Lip Sync; Uncanny Valley; Visual Prosody;

D O I：

10.21437/Interspeech.2018-2587

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Natural movement plays a significant role in realistic speech animation, and numerous studies have demonstrated the contribution visual cues make to the degree human observers find an animation acceptable. Natural, expressive, emotive, and prosodic speech exhibits motion patterns that are difficult to predict with considerable variation in visual modalities. Recently, there have been some impressive demonstrations of face animation derived in some way from the speech signal. Each of these methods have taken unique approaches, but none have included rigid head pose in their predicted output. We observe a high degree of correspondence with facial activity and rigid head pose during speech, and exploit this observation to jointly learn full face animation and head pose rotation and translation combined. From our own corpus, we train Deep Bi-Directional LSTMs (BLSTM) capable of learning long-term structure in language to model the relationship that speech has with the complex activity of the face. We define a model architecture to encourage learning of rigid head motion via the latent space of the speaker's facial activity. The result is a model that can predict lip sync and other facial motion along with rigid head motion directly from audible speech.

引用

页码：2484 / 2488

页数：5

共 50 条

[31] Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion
Karras, Tero
Aila, Timo
Laine, Samuli
Herva, Antti
Lehtinen, Jaakko
ACM TRANSACTIONS ON GRAPHICS, 2017, 36 (04):
[32] WNet: Joint Multiple Head Detection and Head Pose Estimation from a Spectator Crowd Image
Jan, Yasir
Sohel, Ferdous
Shiratuddin, Mohd Fairuz
Wong, Kok Wai
COMPUTER VISION - ACCV 2018 WORKSHOPS, 2019, 11367 : 484 - 493
[33] Expressive facial animation synthesis by learning speech coarticulation and expression spaces
Deng, Zhigang
Neumann, Ulrich
Lewis, J. P.
Kim, Tae-Yong
Bulut, Murtaza
Narayanan, Shrikanth
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2006, 12 (06) : 1523 - 1534
[34] Expressive facial animation synthesis by learning speech coarticulation and expression spaces
IEEE Computer Society
不详
不详
不详
不详
不详
不详
IEEE Trans Visual Comput Graphics, 2006, 6 (1523-1534):
[35] Head Pose Estimation and Movement Analysis for Speech Scene
Komiya, Rinko
Saitoh, Takeshi
Fuyuno, Miharu
Yamashita, Yuko
Nakajima, Yoshitaka
2016 IEEE/ACIS 15TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS), 2016, : 409 - 413
[36] Cross-pose Facial Expression Recognition
Guney, Fatma
Arar, Nuri Murat
Fischer, Mika
Ekenel, Hazim Kemal
2013 10TH IEEE INTERNATIONAL CONFERENCE AND WORKSHOPS ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG), 2013,
[37] Separability of pose and expression in facial tracking and animation
Bascle, B
Blake, A
SIXTH INTERNATIONAL CONFERENCE ON COMPUTER VISION, 1998, : 323 - 328
[38] Disentangling Identity and Pose for Facial Expression Recognition
Jiang, Jing
Deng, Weihong
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2022, 13 (04) : 1868 - 1878
[39] POSE INVARIANT ROBUST FACIAL EXPRESSION ANALYSIS
Win, Khin Thu Zar
Chen, Fan
Izawa, Junko
Kotani, Kazunori
2010 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, 2010, : 3837 - 3840
[40] Pose-Invariant Facial Expression Recognition
Liang, Guang
Wang, Shangfei
Wang, Can
2021 16TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2021), 2021,

← 1 2 3 4 5 →