TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition

被引:3
|
作者
Zhao, Zhengdao [1 ]
Wang, Yuhua [1 ]
Shen, Guang [1 ]
Xu, Yuezhu [1 ]
Zhang, Jiayuan [2 ]
机构
[1] Harbin Engn Univ, High Performance Comp Res Ctr, Harbin 150001, Peoples R China
[2] Harbin Engn Univ, High Performance Comp Lab, Harbin 150001, Peoples R China
基金
中国国家自然科学基金;
关键词
Emotion recognition; Feature extraction; Transformers; Correlation; Data models; Speech recognition; Computer architecture; Deep-scale fusion transformer; multimodal embedding; multimodal emotion recognition; mutual correlation; mutual transformer;
D O I
10.1109/TASLP.2023.3316458
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
As deep learning technology research continues to progress, artificial intelligence technology is gradually empowering various fields. To achieve a more natural human-computer interaction experience, how to accurately recognize emotional state of speech interactions has become a new research hotspot. Sequence modeling methods based on deep learning techniques have promoted the development of emotion recognition, but the mainstream methods still suffer from insufficient multimodal information interaction, difficulty in learning emotion-related features, and low recognition accuracy. In this article, we propose a transformer-based deep-scale fusion network (TDFNet) for multimodal emotion recognition, solving the aforementioned problems. The multimodal embedding (ME) module in TDFNet uses pretrained models to alleviate the data scarcity problem by providing a priori knowledge of multimodal information to the model with the help of a large amount of unlabeled data. Furthermore, a mutual transformer (MT) module is introduced to learn multimodal emotional commonality and speaker-related emotional features to improve contextual emotional semantic understanding. In addition, we design a novel emotion feature learning method named the deep-scale transformer (DST), which further improves emotion recognition by aligning multimodal features and learning multiscale emotion features through GRUs with shared weights. To comparatively evaluate the performance of TDFNet, experiments are conducted with the IEMOCAP corpus under three reasonable data splitting strategies. The experimental results show that TDFNet achieves 82.08% WA and 82.57% UA in RA data splitting, which leads to 1.78% WA and 1.17% UA improvements over the previous state-of-the-art method, respectively. Benefiting from the attentively aligned mutual correlations and fine-grained emotion-related features, TDFNet successfully achieves significant improvements in multimodal emotion recognition.
引用
收藏
页码:3771 / 3782
页数:12
相关论文
共 50 条
  • [31] Improving multimodal fusion with Main Modal Transformer for emotion recognition in conversation
    Zou, ShiHao
    Huang, Xianying
    Shen, XuDong
    Liu, Hankai
    KNOWLEDGE-BASED SYSTEMS, 2022, 258
  • [32] Transformer-based Multimodal Information Fusion for Facial Expression Analysis
    Zhang, Wei
    Qiu, Feng
    Wang, Suzhen
    Zeng, Hao
    Zhang, Zhimeng
    An, Rudong
    Ma, Bowen
    Ding, Yu
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 2427 - 2436
  • [33] Exploring Wearable Emotion Recognition with Transformer-Based Continual Learning
    Rizza, Federica
    Bellitto, Giovanni
    Calcagno, Salvatore
    Palazzo, Simone
    ARTIFICIAL INTELLIGENCE IN PANCREATIC DISEASE DETECTION AND DIAGNOSIS, AND PERSONALIZED INCREMENTAL LEARNING IN MEDICINE, AIPAD 2024, PILM 2024, 2025, 15197 : 86 - 101
  • [34] ERTNet: an interpretable transformer-based framework for EEG emotion recognition
    Liu, Ruixiang
    Chao, Yihu
    Ma, Xuerui
    Sha, Xianzheng
    Sun, Limin
    Li, Shuo
    Chang, Shijie
    FRONTIERS IN NEUROSCIENCE, 2024, 18
  • [35] Multimodal emotion recognition algorithm based on edge network emotion element compensation and data fusion
    Yu Wang
    Personal and Ubiquitous Computing, 2019, 23 : 383 - 392
  • [36] Multimodal emotion recognition algorithm based on edge network emotion element compensation and data fusion
    Wang, Yu
    PERSONAL AND UBIQUITOUS COMPUTING, 2019, 23 (3-4) : 383 - 392
  • [37] The MERSA Dataset and a Transformer-Based Approach for Speech Emotion Recognition
    Zhang, Enshi
    Trujillo, Rafael
    Poellabauer, Christian
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 13960 - 13970
  • [38] Transformer-Based Self-Supervised Learning for Emotion Recognition
    Vazquez-Rodriguez, Juan
    Lefebvre, Gregoire
    Cumin, Julien
    Crowley, James L.
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 2605 - 2612
  • [39] Audio-Visual Fusion Network Based on Conformer for Multimodal Emotion Recognition
    Guo, Peini
    Chen, Zhengyan
    Li, Yidi
    Liu, Hong
    ARTIFICIAL INTELLIGENCE, CICAI 2022, PT II, 2022, 13605 : 315 - 326
  • [40] Hierarchical Attention-Based Multimodal Fusion Network for Video Emotion Recognition
    Liu, Xiaodong
    Li, Songyang
    Wang, Miao
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2021, 2021