An effective multimodal representation and fusion method for multimodal intent recognition

被引:12
|
作者
Huang, Xuejian [1 ,2 ]
Ma, Tinghuai [1 ]
Jia, Li [1 ]
Zhang, Yuanjian [3 ]
Rong, Huan [1 ]
Alnabhan, Najla [4 ]
机构
[1] Nanjing Univ Informat Sci Technol, Sch Comp, Nanjing 210044, Jiangsu, Peoples R China
[2] Jiangxi Univ Finance & Econ, Sch VR Modern Ind, Nanchang 330013, Jiangxi, Peoples R China
[3] China UnionPay Co Ltd, Shanghai 201201, Peoples R China
[4] King Saud Univ, Sch Comp & Informat Sci, Riyadh, Saudi Arabia
基金
中国国家自然科学基金;
关键词
Multimodal intent recognition; Multimodal representation; Multimodal fusion; Attention mechanism; Gated neural network; CLASSIFICATION; TRANSFORMER; KNOWLEDGE;
D O I
10.1016/j.neucom.2023.126373
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Intent recognition is a crucial task in natural language understanding. Current research mainly focuses on task-specific unimodal intent recognition. However, in real-world scenes, human intentions are complex and need to be judged by integrating information such as speech, tone, expression, and action. Therefore, this paper proposes an effective multimodal representation and fusion method (EMRFM) for intent recognition in real-world multimodal scenes. First, text, audio, and vision features are extracted based on pre trained BERT, Wav2vec 2.0, and Faster R-CNN. Then, considering the complementarity and consistency among the modalities, the modality-shared and modality-specific encoders are constructed to learn shared and specific feature representations of the modalities. Finally, an adaptive multimodal fusion method based on an attention-based gated neural network is designed to eliminate noise features. Comprehensive experiments are conducted on the multimodal intent recognition MIntRec benchmark dataset. Our proposed model achieves higher accuracy, precision, recall, and F1-score than state-ofthe-art multimodal learning methods. We also conduct multimodal sentiment recognition experiments on the CMU-MOSI dataset, and our model still outperforms state-of-the-art methods. In addition, the experiment demonstrates that the model's multimodal representation well learned the modality's shared and specific features. The multimodal fusion of the model achieves adaptive fusion and effectively reduces possible noise interference. & COPY; 2023 Elsevier B.V. All rights reserved.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] Multimodal music emotion recognition method based on multi data fusion
    Zeng, Fanguang
    INTERNATIONAL JOURNAL OF ARTS AND TECHNOLOGY, 2023, 14 (04) : 271 - 282
  • [22] Exploring Multimodal Video Representation for Action Recognition
    Wang, Cheng
    Yang, Haojin
    Meinel, Christoph
    2016 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2016, : 1924 - 1931
  • [23] Disentangled Representation Learning for Multimodal Emotion Recognition
    Yang, Dingkang
    Huang, Shuai
    Kuang, Haopeng
    Du, Yangtao
    Zhang, Lihua
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 1642 - 1651
  • [24] Prompt Learning for Multimodal Intent Recognition with Modal Alignment Perception
    Chen, Yuzhao
    Zhu, Wenhua
    Yu, Weilun
    Xue, Hongfei
    Fu, Hao
    Lin, Jiali
    Jiang, Dazhi
    COGNITIVE COMPUTATION, 2024, 16 (06) : 3417 - 3428
  • [25] Multimodal Biometric Person Recognition by Feature Fusion
    Huang, Lin
    Yu, Chenxi
    Cao, Xinzhe
    2018 5TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND CONTROL ENGINEERING (ICISCE 2018), 2018, : 1158 - 1162
  • [26] Multimodal fusion for alzheimer's disease recognition
    Ying, Yangwei
    Yang, Tao
    Zhou, Hong
    APPLIED INTELLIGENCE, 2023, 53 (12) : 16029 - 16040
  • [27] Multimodal Emotion Recognition Based on Feature Fusion
    Xu, Yurui
    Wu, Xiao
    Su, Hang
    Liu, Xiaorui
    2022 INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM 2022), 2022, : 7 - 11
  • [28] MULTIMODAL TRANSFORMER FUSION FOR CONTINUOUS EMOTION RECOGNITION
    Huang, Jian
    Tao, Jianhua
    Liu, Bin
    Lian, Zheng
    Niu, Mingyue
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3507 - 3511
  • [29] Fusion with Hierarchical Graphs for Multimodal Emotion Recognition
    Tang, Shuyun
    Luo, Zhaojie
    Nan, Guoshun
    Baba, Jun
    Yoshikawa, Yuichiro
    Ishiguro, Hiroshi
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1288 - 1296
  • [30] Fusion Architectures for Multimodal Cognitive Load Recognition
    Kindsvater, Daniel
    Meudt, Sascha
    Schwenker, Friedhelm
    MULTIMODAL PATTERN RECOGNITION OF SOCIAL SIGNALS IN HUMAN-COMPUTER-INTERACTION, MPRSS 2016, 2017, 10183 : 36 - 47