TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition

被引:3
|
作者
Zhao, Zhengdao [1 ]
Wang, Yuhua [1 ]
Shen, Guang [1 ]
Xu, Yuezhu [1 ]
Zhang, Jiayuan [2 ]
机构
[1] Harbin Engn Univ, High Performance Comp Res Ctr, Harbin 150001, Peoples R China
[2] Harbin Engn Univ, High Performance Comp Lab, Harbin 150001, Peoples R China
基金
中国国家自然科学基金;
关键词
Emotion recognition; Feature extraction; Transformers; Correlation; Data models; Speech recognition; Computer architecture; Deep-scale fusion transformer; multimodal embedding; multimodal emotion recognition; mutual correlation; mutual transformer;
D O I
10.1109/TASLP.2023.3316458
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
As deep learning technology research continues to progress, artificial intelligence technology is gradually empowering various fields. To achieve a more natural human-computer interaction experience, how to accurately recognize emotional state of speech interactions has become a new research hotspot. Sequence modeling methods based on deep learning techniques have promoted the development of emotion recognition, but the mainstream methods still suffer from insufficient multimodal information interaction, difficulty in learning emotion-related features, and low recognition accuracy. In this article, we propose a transformer-based deep-scale fusion network (TDFNet) for multimodal emotion recognition, solving the aforementioned problems. The multimodal embedding (ME) module in TDFNet uses pretrained models to alleviate the data scarcity problem by providing a priori knowledge of multimodal information to the model with the help of a large amount of unlabeled data. Furthermore, a mutual transformer (MT) module is introduced to learn multimodal emotional commonality and speaker-related emotional features to improve contextual emotional semantic understanding. In addition, we design a novel emotion feature learning method named the deep-scale transformer (DST), which further improves emotion recognition by aligning multimodal features and learning multiscale emotion features through GRUs with shared weights. To comparatively evaluate the performance of TDFNet, experiments are conducted with the IEMOCAP corpus under three reasonable data splitting strategies. The experimental results show that TDFNet achieves 82.08% WA and 82.57% UA in RA data splitting, which leads to 1.78% WA and 1.17% UA improvements over the previous state-of-the-art method, respectively. Benefiting from the attentively aligned mutual correlations and fine-grained emotion-related features, TDFNet successfully achieves significant improvements in multimodal emotion recognition.
引用
收藏
页码:3771 / 3782
页数:12
相关论文
共 50 条
  • [21] Speech Expression Multimodal Emotion Recognition Based on Deep Belief Network
    Liu, Dong
    Chen, Longxi
    Wang, Zhiyong
    Diao, Guangqiang
    JOURNAL OF GRID COMPUTING, 2021, 19 (02)
  • [22] Speech Expression Multimodal Emotion Recognition Based on Deep Belief Network
    Dong Liu
    Longxi Chen
    Zhiyong Wang
    Guangqiang Diao
    Journal of Grid Computing, 2021, 19
  • [23] Transformer-Based Feature Fusion Approach for Multimodal Visual Sentiment Recognition Using Tweets in the Wild
    Alzamzami, Fatimah
    Saddik, Abdulmotaleb El
    IEEE ACCESS, 2023, 11 : 47070 - 47079
  • [24] Multimodal Emotion Recognition Based on Feature Fusion
    Xu, Yurui
    Wu, Xiao
    Su, Hang
    Liu, Xiaorui
    2022 INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM 2022), 2022, : 7 - 11
  • [25] TUFusion: A Transformer-Based Universal Fusion Algorithm for Multimodal Images
    Zhao, Yangyang
    Zheng, Qingchun
    Zhu, Peihao
    Zhang, Xu
    Ma, Wenpeng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (03) : 1712 - 1725
  • [26] Transformer-based Multimodal Information Fusion for Facial Expression Analysis
    Zhang, Wei
    Qiu, Feng
    Wang, Suzhen
    Zeng, Hao
    Zhang, Zhimeng
    An, Rudong
    Ma, Bowen
    Ding, Yu
    IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2022, 2022-June : 2427 - 2436
  • [27] Emotion Recognition and Classification of Film Reviews Based on Deep Learning and Multimodal Fusion
    Na, Risu
    Sun, Ning
    WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2022, 2022
  • [28] Feature Fusion for Multimodal Emotion Recognition Based on Deep Canonical Correlation Analysis
    Zhang, Ke
    Li, Yuanqing
    Wang, Jingyu
    Wang, Zhen
    Li, Xuelong
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 1898 - 1902
  • [29] Transformer-based Multimodal Information Fusion for Facial Expression Analysis
    Zhang, Wei
    Qiu, Feng
    Wang, Suzhen
    Zeng, Hao
    Zhang, Zhimeng
    An, Rudong
    Ma, Bowen
    Ding, Yu
    arXiv, 2022,
  • [30] Korean Sign Language Recognition Using Transformer-Based Deep Neural Network
    Shin, Jungpil
    Musa Miah, Abu Saleh
    Hasan, Md. Al Mehedi
    Hirooka, Koki
    Suzuki, Kota
    Lee, Hyoun-Sup
    Jang, Si-Woong
    APPLIED SCIENCES-BASEL, 2023, 13 (05):