Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild

被引:50
|
作者
Lu, Cheng [1 ]
Zheng, Wenming [2 ]
Li, Chaolong [3 ]
Tang, Chuangao [3 ]
Liu, Suyuan [3 ]
Yan, Simeng [3 ]
Zong, Yuan [3 ]
机构
[1] Southeast Univ, Sch Informat Sci & Engn, Nanjing, Jiangsu, Peoples R China
[2] Southeast Univ, Sch Biol Sci & Med Engn, Minist Educ, Key Lab Child Dev & Learning Sci, Nanjing, Jiangsu, Peoples R China
[3] Southeast Univ, Sch Biol Sci & Med Engn, Nanjing, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
Emotion Recognition; Spatio-Temporal Information; Convolutional Neural Networks (CNN); Long Short-Term Memory (LSTM); 3D Convolutional Neural Networks (3D CNN); CLASSIFICATION;
D O I
10.1145/3242969.3264992
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The difficulty of emotion recognition in the wild (EmotiW) is how to train a robust model to deal with diverse scenarios and anomalies. The Audio-video Sub-challenge in EmotiW contains audio video short clips with several emotional labels and the task is to distinguish which label the video belongs to. For the better emotion recognition in videos, we propose a multiple spatio-temporal feature fusion (MSFF) framework, which can more accurately depict emotional information in spatial and temporal dimensions by two mutually complementary sources, including the facial image and audio. The framework is consisted of two parts: the facial image model and the audio model. With respect to the facial image model, three different architectures of spatial-temporal neural networks are employed to extract discriminative features about different emotions in facial expression images. Firstly, the high-level spatial features are obtained by the pre-trained convolutional neural networks (CNN), including VGG-Face and ResNet-50 which are all fed with the images generated by each video. Then, the features of all frames are sequentially input to the Bi-directional Long Short-Term Memory (BLSTM) so as to capture dynamic variations of facial appearance textures in a video. In addition to the structure of CNN-RNN, another spatio-temporal network, namely deep 3-Dimensional Convolutional Neural Networks (3D CNN) by extending the 2D convolution kernel to 3D, is also applied to attain evolving emotional information encoded in multiple adjacent frames. For the audio model, the spectrogram images of speech generated by preprocessing audio, are also modeled in a VGG-BLSTM framework to characterize the affective fluctuation more efficiently. Finally, a fusion strategy with the score matrices of different spatiotemporal networks gained from the above framework is proposed to boost the performance of emotion recognition complementally. Extensive experiments show that the overall accuracy of our proposed MSFF is 60.64%, which achieves a large improvement compared with the baseline and outperform the result of champion team in 2017.
引用
收藏
页码:646 / 652
页数:7
相关论文
共 50 条
  • [31] A FAST ADAPTIVE SPATIO-TEMPORAL 3D FEATURE FOR VIDEO-BASED PERSON RE-IDENTIFICATION
    Liu, Zheng
    Chen, Jiaxin
    Wang, Yunhong
    2016 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2016, : 4294 - 4298
  • [32] Learning spatio-temporal features for action recognition from the side of the video
    Lishen Pei
    Mao Ye
    Xuezhuan Zhao
    Tao Xiang
    Tao Li
    Signal, Image and Video Processing, 2016, 10 : 199 - 206
  • [33] Multi-scale spatio-temporal feature adaptive aggregation for video-based Person Re-identification
    Zhao, Wei
    Huang, Yan
    Wang, Guoyou
    Zhang, Bo
    Gao, Yuhang
    Liu, Yuze
    KNOWLEDGE-BASED SYSTEMS, 2024, 299
  • [34] HASTF: a hybrid attention spatio-temporal feature fusion network for EEG emotion recognition
    Hu, Fangzhou
    Wang, Fei
    Bi, Jinying
    An, Zida
    Chen, Chao
    Qu, Gangguo
    Han, Shuai
    FRONTIERS IN NEUROSCIENCE, 2024, 18
  • [35] Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition
    Dung Nguyen
    Kien Nguyen
    Sridharan, Sridha
    Dean, David
    Fookes, Clinton
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2018, 174 : 33 - 42
  • [36] Context-Aware Emotion Recognition in the Wild Using Spatio-Temporal and Temporal-Pyramid Models
    Do, Nhu-Tai
    Kim, Soo-Hyung
    Yang, Hyung-Jeong
    Lee, Guee-Sang
    Yeom, Soonja
    SENSORS, 2021, 21 (07)
  • [37] A Novel Spatio-Temporal Field for Emotion Recognition Based on EEG Signals
    Li, Wei
    Zhang, Zhen
    Hou, Bowen
    Li, Xiaoyu
    IEEE SENSORS JOURNAL, 2021, 21 (23) : 26941 - 26950
  • [38] Early video-based smoke detection in outdoor spaces by spatio-temporal clustering
    Favorskaya, Margarita N.
    Levtin, Konstantin E.
    Favorskaya, M.N. (favorskaya@sibsau.ru), 1600, Inderscience Enterprises Ltd. (05): : 133 - 144
  • [39] Video-based salient object detection via spatio-temporal difference and coherence
    Lei Huang
    Bin Luo
    Multimedia Tools and Applications, 2018, 77 : 10685 - 10699
  • [40] Spatio-Temporal Representation Factorization for Video-based Person Re-Identification
    Aich, Abhishek
    Zheng, Meng
    Karanam, Srikrishna
    Chen, Terrence
    Roy-Chowdhury, Amit K.
    Wu, Ziyan
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 152 - 162