Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild

被引：50

作者：

Lu, Cheng ^{[1
]}

Zheng, Wenming ^{[2
]}

Li, Chaolong ^{[3
]}

Tang, Chuangao ^{[3
]}

Liu, Suyuan ^{[3
]}

Yan, Simeng ^{[3
]}

Zong, Yuan ^{[3
]}

机构：

[1] Southeast Univ, Sch Informat Sci & Engn, Nanjing, Jiangsu, Peoples R China

[2] Southeast Univ, Sch Biol Sci & Med Engn, Minist Educ, Key Lab Child Dev & Learning Sci, Nanjing, Jiangsu, Peoples R China

[3] Southeast Univ, Sch Biol Sci & Med Engn, Nanjing, Jiangsu, Peoples R China

来源：

ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION | 2018年

基金：

中国国家自然科学基金;

关键词：

Emotion Recognition; Spatio-Temporal Information; Convolutional Neural Networks (CNN); Long Short-Term Memory (LSTM); 3D Convolutional Neural Networks (3D CNN); CLASSIFICATION;

D O I：

10.1145/3242969.3264992

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The difficulty of emotion recognition in the wild (EmotiW) is how to train a robust model to deal with diverse scenarios and anomalies. The Audio-video Sub-challenge in EmotiW contains audio video short clips with several emotional labels and the task is to distinguish which label the video belongs to. For the better emotion recognition in videos, we propose a multiple spatio-temporal feature fusion (MSFF) framework, which can more accurately depict emotional information in spatial and temporal dimensions by two mutually complementary sources, including the facial image and audio. The framework is consisted of two parts: the facial image model and the audio model. With respect to the facial image model, three different architectures of spatial-temporal neural networks are employed to extract discriminative features about different emotions in facial expression images. Firstly, the high-level spatial features are obtained by the pre-trained convolutional neural networks (CNN), including VGG-Face and ResNet-50 which are all fed with the images generated by each video. Then, the features of all frames are sequentially input to the Bi-directional Long Short-Term Memory (BLSTM) so as to capture dynamic variations of facial appearance textures in a video. In addition to the structure of CNN-RNN, another spatio-temporal network, namely deep 3-Dimensional Convolutional Neural Networks (3D CNN) by extending the 2D convolution kernel to 3D, is also applied to attain evolving emotional information encoded in multiple adjacent frames. For the audio model, the spectrogram images of speech generated by preprocessing audio, are also modeled in a VGG-BLSTM framework to characterize the affective fluctuation more efficiently. Finally, a fusion strategy with the score matrices of different spatiotemporal networks gained from the above framework is proposed to boost the performance of emotion recognition complementally. Extensive experiments show that the overall accuracy of our proposed MSFF is 60.64%, which achieves a large improvement compared with the baseline and outperform the result of champion team in 2017.

引用

页码：646 / 652

页数：7

共 50 条

[21] A Video Retrieval Algorithm Based on Spatio-temporal Feature Curves
Chen, Xiuxin
Jia, Kebin
Zhuang, Xinyue
2008 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND INFORMATION TECHNOLOGY, PROCEEDINGS, 2008, : 287 - 290
[22] Spatio-Temporal feature based VLAD for efficient Video retrieval
Reddy, Mopuri K.
Arora, Sahil
Babu, R. Venkatesh
2013 FOURTH NATIONAL CONFERENCE ON COMPUTER VISION, PATTERN RECOGNITION, IMAGE PROCESSING AND GRAPHICS (NCVPRIPG), 2013,
[23] Deep video action clustering via spatio-temporal feature learning
Peng, Bo
Lei, Jianjun
Fu, Huazhu
Jia, Yalong
Zhang, Zongqian
Li, Yi
NEUROCOMPUTING, 2021, 456 : 519 - 527
[24] Interactive spatio-temporal feature learning network for video foreground detection
Hongrui Zhang
Huan Li
Complex & Intelligent Systems, 2022, 8 : 4251 - 4263
[25] Guest Editorial: Spatio-temporal Feature Learning for Unconstrained Video Analysis
Han, Yahong
Nie, Liqiang
Wu, Fei
MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (22) : 29209 - 29211
[26] Interactive spatio-temporal feature learning network for video foreground detection
Zhang, Hongrui
Li, Huan
COMPLEX & INTELLIGENT SYSTEMS, 2022, 8 (05) : 4251 - 4263
[27] Guest Editorial: Spatio-temporal Feature Learning for Unconstrained Video Analysis
Yahong Han
Liqiang Nie
Fei Wu
Multimedia Tools and Applications, 2018, 77 : 29209 - 29211
[28] MICRO-EXPRESSION RECOGNITION BASED ON THE SPATIO-TEMPORAL FEATURE
Su, Wenchao
Wang, Yanyan
Su, Fei
Zhao, Zhicheng
2018 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW 2018), 2018,
[29] Human Action Recognition Based on a Spatio-Temporal Video Autoencoder
Sousa e Santos, Anderson Carlos
Pedrini, Helio
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2020, 34 (11)
[30] Learning spatio-temporal features for action recognition from the side of the video
Pei, Lishen
Ye, Mao
Zhao, Xuezhuan
Xiang, Tao
Li, Tao
SIGNAL IMAGE AND VIDEO PROCESSING, 2016, 10 (01) : 199 - 206

← 1 2 3 4 5 →