TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition

被引:3
|
作者
Zhao, Zhengdao [1 ]
Wang, Yuhua [1 ]
Shen, Guang [1 ]
Xu, Yuezhu [1 ]
Zhang, Jiayuan [2 ]
机构
[1] Harbin Engn Univ, High Performance Comp Res Ctr, Harbin 150001, Peoples R China
[2] Harbin Engn Univ, High Performance Comp Lab, Harbin 150001, Peoples R China
基金
中国国家自然科学基金;
关键词
Emotion recognition; Feature extraction; Transformers; Correlation; Data models; Speech recognition; Computer architecture; Deep-scale fusion transformer; multimodal embedding; multimodal emotion recognition; mutual correlation; mutual transformer;
D O I
10.1109/TASLP.2023.3316458
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
As deep learning technology research continues to progress, artificial intelligence technology is gradually empowering various fields. To achieve a more natural human-computer interaction experience, how to accurately recognize emotional state of speech interactions has become a new research hotspot. Sequence modeling methods based on deep learning techniques have promoted the development of emotion recognition, but the mainstream methods still suffer from insufficient multimodal information interaction, difficulty in learning emotion-related features, and low recognition accuracy. In this article, we propose a transformer-based deep-scale fusion network (TDFNet) for multimodal emotion recognition, solving the aforementioned problems. The multimodal embedding (ME) module in TDFNet uses pretrained models to alleviate the data scarcity problem by providing a priori knowledge of multimodal information to the model with the help of a large amount of unlabeled data. Furthermore, a mutual transformer (MT) module is introduced to learn multimodal emotional commonality and speaker-related emotional features to improve contextual emotional semantic understanding. In addition, we design a novel emotion feature learning method named the deep-scale transformer (DST), which further improves emotion recognition by aligning multimodal features and learning multiscale emotion features through GRUs with shared weights. To comparatively evaluate the performance of TDFNet, experiments are conducted with the IEMOCAP corpus under three reasonable data splitting strategies. The experimental results show that TDFNet achieves 82.08% WA and 82.57% UA in RA data splitting, which leads to 1.78% WA and 1.17% UA improvements over the previous state-of-the-art method, respectively. Benefiting from the attentively aligned mutual correlations and fine-grained emotion-related features, TDFNet successfully achieves significant improvements in multimodal emotion recognition.
引用
收藏
页码:3771 / 3782
页数:12
相关论文
共 50 条
  • [41] GraphMFT: A graph network based multimodal fusion technique for emotion recognition in conversation
    Li, Jiang
    Wang, Xiaoping
    Lv, Guoqing
    Zeng, Zhigang
    NEUROCOMPUTING, 2023, 550
  • [42] TNTC: TWO-STREAM NETWORK WITH TRANSFORMER-BASED COMPLEMENTARITY FOR GAIT-BASED EMOTION RECOGNITION
    Hu, Chuanfei
    Sheng, Weijie
    Dong, Bo
    Li, Xinde
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 3229 - 3233
  • [43] Deep Feature Extraction and Attention Fusion for Multimodal Emotion Recognition
    Yang, Zhiyi
    Li, Dahua
    Hou, Fazheng
    Song, Yu
    Gao, Qiang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2024, 71 (03) : 1526 - 1530
  • [44] Topics Guided Multimodal Fusion Network for Conversational Emotion Recognition
    Yuan, Peicong
    Cai, Guoyong
    Chen, Ming
    Tang, Xiaolv
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT III, ICIC 2024, 2024, 14877 : 250 - 262
  • [45] A Transformer-Based Network for Dynamic Hand Gesture Recognition
    D'Eusanio, Andrea
    Simoni, Alessandro
    Pini, Stefano
    Borghi, Guido
    Vezzani, Roberto
    Cucchiara, Rita
    2020 INTERNATIONAL CONFERENCE ON 3D VISION (3DV 2020), 2020, : 623 - 632
  • [46] MM-NodeFormer: Node Transformer Multimodal Fusion for Emotion Recognition in Conversation
    Huang, Zilong
    Mak, Man-Wai
    Lee, Kong Aik
    INTERSPEECH 2024, 2024, : 4069 - 4073
  • [47] Residual multimodal Transformer for expression-EEG fusion continuous emotion recognition
    Jin, Xiaofang
    Xiao, Jieyu
    Jin, Libiao
    Zhang, Xinruo
    CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2024, 9 (05) : 1290 - 1304
  • [48] A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis
    Delbrouck, Jean-Benoit
    Tits, Noe
    Brousmiche, Mathilde
    Dupont, Stephane
    PROCEEDINGS OF THE SECOND GRAND CHALLENGE AND WORKSHOP ON MULTIMODAL LANGUAGE (CHALLENGE-HML), VOL 1, 2020, : 1 - 7
  • [49] Emotion Recognition Based on Feedback Weighted Fusion of Multimodal Emotion Data
    Wei, Wei
    Jia, Qingxuan
    Feng, Yongli
    2017 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND BIOMIMETICS (IEEE ROBIO 2017), 2017, : 1682 - 1687
  • [50] A spatial and temporal transformer-based EEG emotion recognition in VR environment
    Li, Ming
    Yu, Peng
    Shen, Yang
    FRONTIERS IN HUMAN NEUROSCIENCE, 2025, 19