Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition

被引:64
|
作者
Zhang, Shiqing [1 ,2 ]
Zhang, Shiliang [1 ]
Huang, Tiejun [1 ]
Gao, Wen [1 ]
机构
[1] Peking Univ, Sch EE&CS, Beijing, Peoples R China
[2] Taizhou Univ, Inst Intelligent Informat Proc, Taizhou, Peoples R China
关键词
Emotion recognition; Multimodal deep learning; Deep convolution neural network;
D O I
10.1145/2911996.2912051
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Emotion recognition is a challenging task because of the emotional gap between subjective emotion and the low-level audio-visual features. Inspired by the recent success of deep learning in bridging the semantic gap, this paper proposes to bridge the emotional gap based on a multimodal Deep Convolution Neural Network (DCNN), which fuses the audio and visual cues in a deep model. This multimodal DCNN is trained with two stages. First, two DCNN models pre-trained on large-scale image data are fine-tuned to perform audio and visual emotion recognition tasks respectively on the corresponding labeled speech and face data. Second, the outputs of these two DCNNs are integrated in a fusion network constructed by a number of fully-connected layers. The fusion network is trained to obtain a joint audio-visual feature representation for emotion recognition. Experimental results on the RML audio-visual database demonstrates the promising performance of the proposed method. To the best of our knowledge, this is an early work fusing audio and visual cues in DCNN for emotion recognition. Its success guarantees further research in this direction.
引用
收藏
页码:281 / 284
页数:4
相关论文
共 50 条
  • [21] Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities
    Middya, Asif Iqbal
    Nag, Baibhav
    Roy, Sarbani
    KNOWLEDGE-BASED SYSTEMS, 2022, 244
  • [22] Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities
    Middya, Asif Iqbal
    Nag, Baibhav
    Roy, Sarbani
    KNOWLEDGE-BASED SYSTEMS, 2022, 244
  • [23] RECURRENT NEURAL NETWORK TRANSDUCER FOR AUDIO-VISUAL SPEECH RECOGNITION
    Makino, Takaki
    Liao, Hank
    Assael, Yannis
    Shillingford, Brendan
    Garcia, Basilio
    Braga, Otavio
    Siohan, Olivier
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 905 - 912
  • [24] Deep Audio-Visual Speech Recognition
    Afouras, Triantafyllos
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
  • [25] Multimodal Emotion Recognition Based on Ensemble Convolutional Neural Network
    Huang, Haiping
    Hu, Zhenchao
    Wang, Wenming
    Wu, Min
    IEEE ACCESS, 2020, 8 : 3265 - 3271
  • [26] AUDIO-VISUAL KEYWORD SPOTTING BASED ON MULTIDIMENSIONAL CONVOLUTIONAL NEURAL NETWORK
    Ding, Runwei
    Pang, Cheng
    Liu, Hong
    2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 4138 - 4142
  • [27] DEEP AUDIO-VISUAL FUSION NEURAL NETWORK FOR SALIENCY ESTIMATION
    Yao, Shunyu
    Min, Xiongkuo
    Zhai, Guangtao
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 1604 - 1608
  • [28] Multimodal Emotion Recognition and State Analysis of Classroom Video and Audio Based on Deep Neural Network
    Li, Mingyong
    Liu, Mingyue
    Jiang, Zheng
    Zhao, Zongwei
    Zhang, Jiayan
    Ge, Mingyuan
    Duan, Huiming
    Wang, Yanxia
    JOURNAL OF INTERCONNECTION NETWORKS, 2022, 22 (SUPP04)
  • [29] Combining audio and visual speech recognition using LSTM and deep convolutional neural network
    Shashidhar R.
    Patilkulkarni S.
    Puneeth S.B.
    International Journal of Information Technology, 2022, 14 (7) : 3425 - 3436
  • [30] Audio-Visual Deep Neural Network for Robust Person Verification
    Qian, Yanmin
    Chen, Zhengyang
    Wang, Shuai
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1079 - 1092