Spatio-Temporal Encoder-Decoder Fully Convolutional Network for Video-Based Dimensional Emotion Recognition

被引:19
|
作者
Du, Zhengyin [1 ]
Wu, Suowei [2 ]
Huang, Di [1 ]
Li, Weixin [3 ]
Wang, Yunhong [3 ]
机构
[1] Beihang Univ, Beijing Adv Innovat Ctr Big Data & Brain Comp, Sch Comp Sci & Engn, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China
[2] Beihang Univ, Beijing Adv Innovat Ctr Big Data & Brain Comp, Sino French Engineer Sch, Beijing 100191, Peoples R China
[3] Beihang Univ, Beijing Adv Innovat Ctr Big Data & Brain Comp, Beijing 100191, Peoples R China
基金
中国国家自然科学基金;
关键词
Emotion recognition; Convolution; Decoding; Feature extraction; Videos; Visualization; Task analysis; Dimensional emotion recognition; spatio-temporal fully convolutional network; temporal hourglass CNN; temporal intermediate supervision; EXPRESSION RECOGNITION;
D O I
10.1109/TAFFC.2019.2940224
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video-based dimensional emotion recognition aims to map human affect into the dimensional emotion space based on visual signals, which is a fundamental challenge in affective computing and human-computer interaction. In this paper, we present a novel encoder-decoder framework to tackle this problem. It adopts a fully convolutional design with the cascaded 2D convolution based spatial encoder and 1D convolution based temporal encoder-decoder for joint spatio-temporal modeling. In particular, to address the key issue of capturing discriminative long-term dynamic dependency, our temporal model, referred to as Temporal Hourglass Convolutional Neural Network (TH-CNN), extracts contextual relationship through integrating both low-level encoded and high-level decoded clues. Temporal Intermediate Supervision (TIS) is then introduced to enhance affective representations generated by TH-CNN under a multi-resolution strategy, which guides TH-CNN to learn macroscopic long-term trend and refined short-term fluctuations progressively. Furthermore, thanks to TH-CNN and TIS, knowledge learnt from the intermediate layers also makes it possible to offer customized solutions to different applications by adjusting the decoder depth. Extensive experiments are conducted on three benchmark databases (RECOLA, SEWA and OMG) and superior results are shown compared to state-of-the-art methods, which indicates the effectiveness of the proposed approach.
引用
收藏
页码:565 / 578
页数:14
相关论文
共 50 条
  • [1] Predicting 360° Video Saliency: A ConvLSTM Encoder-Decoder Network With Spatio-Temporal Consistency
    Wan, Zhaolin
    Qin, Han
    Xiong, Ruiqin
    Li, Zhiyang
    Fan, Xiaopeng
    Zhao, Debin
    IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2024, 14 (02) : 311 - 322
  • [2] A Novel Spatio-Temporal 3D Convolutional Encoder-Decoder Network for Dynamic Saliency Prediction
    Li, Hao
    Qi, Fei
    Shi, Guangming
    IEEE ACCESS, 2021, 9 : 36328 - 36341
  • [3] Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild
    Lu, Cheng
    Zheng, Wenming
    Li, Chaolong
    Tang, Chuangao
    Liu, Suyuan
    Yan, Simeng
    Zong, Yuan
    ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 646 - 652
  • [4] Video-based Emotion Recognition using Aggregated Features and Spatio-temporal Information
    Xu, Jinchang
    Dong, Yuan
    Ma, Lilei
    Bai, Hongliang
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 2833 - 2838
  • [5] Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning
    Chen, Jingwen
    Pan, Yingwei
    Li, Yehao
    Yao, Ting
    Chao, Hongyang
    Mei, Tao
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8167 - 8174
  • [6] Object Contour Detection with a Fully Convolutional Encoder-Decoder Network
    Yang, Jimei
    Price, Brian
    Cohen, Scott
    Lee, Honglak
    Yang, Ming-Hsuan
    2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 193 - 202
  • [7] Microseismic Signal Denoising and Separation Based on Fully Convolutional Encoder-Decoder Network
    Zhang, Hang
    Ma, Chunchi
    Pazzi, Veronica
    Zou, Yulin
    Casagli, Nicola
    APPLIED SCIENCES-BASEL, 2020, 10 (18):
  • [8] Spatio-temporal keypoints for video-based face recognition
    Franco, A.
    Maio, D.
    Turroni, F.
    2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, : 489 - 494
  • [9] Denoising Raman spectra using fully convolutional encoder-decoder network
    Loc, Irem
    Kecoglu, Ibrahim
    Unlu, Mehmet Burcin
    Parlatan, Ugur
    JOURNAL OF RAMAN SPECTROSCOPY, 2022, 53 (08) : 1445 - 1452
  • [10] RED-Net: A Recurrent Encoder-Decoder Network for Video-Based Face Alignment
    Peng, Xi
    Feris, Rogerio S.
    Wang, Xiaoyu
    Metaxas, Dimitris N.
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2018, 126 (10) : 1103 - 1119