Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild

被引:50
|
作者
Lu, Cheng [1 ]
Zheng, Wenming [2 ]
Li, Chaolong [3 ]
Tang, Chuangao [3 ]
Liu, Suyuan [3 ]
Yan, Simeng [3 ]
Zong, Yuan [3 ]
机构
[1] Southeast Univ, Sch Informat Sci & Engn, Nanjing, Jiangsu, Peoples R China
[2] Southeast Univ, Sch Biol Sci & Med Engn, Minist Educ, Key Lab Child Dev & Learning Sci, Nanjing, Jiangsu, Peoples R China
[3] Southeast Univ, Sch Biol Sci & Med Engn, Nanjing, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
Emotion Recognition; Spatio-Temporal Information; Convolutional Neural Networks (CNN); Long Short-Term Memory (LSTM); 3D Convolutional Neural Networks (3D CNN); CLASSIFICATION;
D O I
10.1145/3242969.3264992
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The difficulty of emotion recognition in the wild (EmotiW) is how to train a robust model to deal with diverse scenarios and anomalies. The Audio-video Sub-challenge in EmotiW contains audio video short clips with several emotional labels and the task is to distinguish which label the video belongs to. For the better emotion recognition in videos, we propose a multiple spatio-temporal feature fusion (MSFF) framework, which can more accurately depict emotional information in spatial and temporal dimensions by two mutually complementary sources, including the facial image and audio. The framework is consisted of two parts: the facial image model and the audio model. With respect to the facial image model, three different architectures of spatial-temporal neural networks are employed to extract discriminative features about different emotions in facial expression images. Firstly, the high-level spatial features are obtained by the pre-trained convolutional neural networks (CNN), including VGG-Face and ResNet-50 which are all fed with the images generated by each video. Then, the features of all frames are sequentially input to the Bi-directional Long Short-Term Memory (BLSTM) so as to capture dynamic variations of facial appearance textures in a video. In addition to the structure of CNN-RNN, another spatio-temporal network, namely deep 3-Dimensional Convolutional Neural Networks (3D CNN) by extending the 2D convolution kernel to 3D, is also applied to attain evolving emotional information encoded in multiple adjacent frames. For the audio model, the spectrogram images of speech generated by preprocessing audio, are also modeled in a VGG-BLSTM framework to characterize the affective fluctuation more efficiently. Finally, a fusion strategy with the score matrices of different spatiotemporal networks gained from the above framework is proposed to boost the performance of emotion recognition complementally. Extensive experiments show that the overall accuracy of our proposed MSFF is 60.64%, which achieves a large improvement compared with the baseline and outperform the result of champion team in 2017.
引用
收藏
页码:646 / 652
页数:7
相关论文
共 50 条
  • [41] A SPATIO-TEMPORAL APPEARANCE REPRESENTATION FOR VIDEO-BASED PEDESTRIAN RE-IDENTIFICATION
    Liu, Kan
    Ma, Bingpeng
    Zhang, Wei
    Huang, Rui
    2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 3810 - 3818
  • [42] Video-based salient object detection via spatio-temporal difference and coherence
    Huang, Lei
    Luo, Bin
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (09) : 10685 - 10699
  • [43] Real-time Attention-Augmented Spatio-Temporal Networks for Video-based Driver Activity Recognition
    Saleh, Khaled
    Mihaita, Adriana-Simona
    Yu, Kun
    Chen, Fang
    2022 IEEE 25TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC), 2022, : 1579 - 1585
  • [44] Blind video quality assessment based on Spatio-Temporal Feature Resolver
    Bi, Xiaodong
    He, Xiaohai
    Xiong, Shuhua
    Zhao, Zeming
    Chen, Honggang
    Sheriff, Raymond Edward
    NEUROCOMPUTING, 2024, 574
  • [45] Two-Person Interaction Recognition Based on Video Sparse Representation and Improved Spatio-Temporal Feature
    Cao Jiangtao
    Wang Peiyao
    Chen Shuqi
    Ji Xiaofei
    INTELLIGENT ROBOTICS AND APPLICATIONS, ICIRA 2019, PT V, 2019, 11744 : 473 - 488
  • [46] Bidirectional Spatio-Temporal Feature Learning With Multiscale Evaluation for Video Anomaly Detection
    Zhong, Yuanhong
    Chen, Xia
    Hu, Yongting
    Tang, Panliang
    Ren, Fan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (12) : 8285 - 8296
  • [47] Deep spatio-temporal features for multimodal emotion recognition
    Nguyen, Dung
    Nguyen, Kien
    Sridharan, Sridha
    Ghasemi, Afsane
    Dean, David
    Fookes, Clinton
    2017 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2017), 2017, : 1215 - 1223
  • [48] Video-Based Emotion Recognition using Face Frontalization and Deep Spatiotemporal Feature
    Wang, Jinwei
    Zhao, Ziping
    Liang, Jinglian
    Li, Chao
    2018 FIRST ASIAN CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII ASIA), 2018,
  • [49] Unsupervised Video Hashing by Exploiting Spatio-Temporal Feature
    Ma, Chao
    Gu, Yun
    Liu, Wei
    Yang, Jie
    He, Xiangjian
    NEURAL INFORMATION PROCESSING, ICONIP 2016, PT III, 2016, 9949 : 511 - 518
  • [50] Spatio-Temporal Feature Extraction/Recognition in Videos Based on Energy Optimization
    Sakaino, Hidetomo
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (07) : 3395 - 3407