Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition

被引:64
|
作者
Zhang, Shiqing [1 ,2 ]
Zhang, Shiliang [1 ]
Huang, Tiejun [1 ]
Gao, Wen [1 ]
机构
[1] Peking Univ, Sch EE&CS, Beijing, Peoples R China
[2] Taizhou Univ, Inst Intelligent Informat Proc, Taizhou, Peoples R China
关键词
Emotion recognition; Multimodal deep learning; Deep convolution neural network;
D O I
10.1145/2911996.2912051
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Emotion recognition is a challenging task because of the emotional gap between subjective emotion and the low-level audio-visual features. Inspired by the recent success of deep learning in bridging the semantic gap, this paper proposes to bridge the emotional gap based on a multimodal Deep Convolution Neural Network (DCNN), which fuses the audio and visual cues in a deep model. This multimodal DCNN is trained with two stages. First, two DCNN models pre-trained on large-scale image data are fine-tuned to perform audio and visual emotion recognition tasks respectively on the corresponding labeled speech and face data. Second, the outputs of these two DCNNs are integrated in a fusion network constructed by a number of fully-connected layers. The fusion network is trained to obtain a joint audio-visual feature representation for emotion recognition. Experimental results on the RML audio-visual database demonstrates the promising performance of the proposed method. To the best of our knowledge, this is an early work fusing audio and visual cues in DCNN for emotion recognition. Its success guarantees further research in this direction.
引用
收藏
页码:281 / 284
页数:4
相关论文
共 50 条
  • [31] Learning Affective Features With a Hybrid Deep Model for Audio-Visual Emotion Recognition
    Zhang, Shiqing
    Zhang, Shiliang
    Huang, Tiejun
    Gao, Wen
    Tian, Qi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2018, 28 (10) : 3030 - 3043
  • [32] Deep Learning Based Audio-Visual Emotion Recognition in a Smart Learning Environment
    Ivleva, Natalja
    Pentel, Avar
    Dunajeva, Olga
    Justsenko, Valeria
    TOWARDS A HYBRID, FLEXIBLE AND SOCIALLY ENGAGED HIGHER EDUCATION, VOL 1, ICL 2023, 2024, 899 : 420 - 431
  • [33] Fish behavior recognition based on an audio-visual multimodal interactive fusion network
    Yang, Yuxin
    Yu, Hong
    Zhang, Xin
    Zhang, Peng
    Tu, Wan
    Gu, Lishuai
    AQUACULTURAL ENGINEERING, 2024, 107
  • [34] Audio-visual based emotion recognition - A new approach
    Song, ML
    Bu, JJ
    Chen, C
    Li, N
    PROCEEDINGS OF THE 2004 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 2, 2004, : 1020 - 1025
  • [35] AUDIO-VISUAL EMOTION RECOGNITION WITH BOOSTED COUPLED HMM
    Lu, Kun
    Jia, Yunde
    2012 21ST INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR 2012), 2012, : 1148 - 1151
  • [36] An audio-visual corpus for multimodal automatic speech recognition
    Andrzej Czyzewski
    Bozena Kostek
    Piotr Bratoszewski
    Jozef Kotus
    Marcin Szykulski
    Journal of Intelligent Information Systems, 2017, 49 : 167 - 192
  • [37] Temporal aggregation of audio-visual modalities for emotion recognition
    Birhala, Andreea
    Ristea, Catalin Nicolae
    Radoi, Anamaria
    Dutu, Liviu Cristian
    2020 43RD INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2020, : 305 - 308
  • [38] Audio-Visual Speech Recognition System Using Recurrent Neural Network
    Goh, Yeh-Huann
    Lau, Kai-Xian
    Lee, Yoon-Ket
    PROCEEDINGS OF THE 2019 4TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY (INCIT): ENCOMPASSING INTELLIGENT TECHNOLOGY AND INNOVATION TOWARDS THE NEW ERA OF HUMAN LIFE, 2019, : 38 - 43
  • [39] An audio-visual corpus for multimodal automatic speech recognition
    Czyzewski, Andrzej
    Kostek, Bozena
    Bratoszewski, Piotr
    Kotus, Jozef
    Szykulski, Marcin
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2017, 49 (02) : 167 - 192
  • [40] AUDIO-VISUAL EMOTION RECOGNITION USING BOLTZMANN ZIPPERS
    Lu, Kun
    Jia, Yunde
    2012 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP 2012), 2012, : 2589 - 2592