An effective multimodal representation and fusion method for multimodal intent recognition

被引：12

作者：

Huang, Xuejian ^{[1
,2
]}

Ma, Tinghuai ^{[1
]}

Jia, Li ^{[1
]}

Zhang, Yuanjian ^{[3
]}

Rong, Huan ^{[1
]}

Alnabhan, Najla ^{[4
]}

机构：

[1] Nanjing Univ Informat Sci Technol, Sch Comp, Nanjing 210044, Jiangsu, Peoples R China

[2] Jiangxi Univ Finance & Econ, Sch VR Modern Ind, Nanchang 330013, Jiangxi, Peoples R China

[3] China UnionPay Co Ltd, Shanghai 201201, Peoples R China

[4] King Saud Univ, Sch Comp & Informat Sci, Riyadh, Saudi Arabia

来源：

NEUROCOMPUTING | 2023年 / 548卷

基金：

中国国家自然科学基金;

关键词：

Multimodal intent recognition; Multimodal representation; Multimodal fusion; Attention mechanism; Gated neural network; CLASSIFICATION; TRANSFORMER; KNOWLEDGE;

D O I：

10.1016/j.neucom.2023.126373

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Intent recognition is a crucial task in natural language understanding. Current research mainly focuses on task-specific unimodal intent recognition. However, in real-world scenes, human intentions are complex and need to be judged by integrating information such as speech, tone, expression, and action. Therefore, this paper proposes an effective multimodal representation and fusion method (EMRFM) for intent recognition in real-world multimodal scenes. First, text, audio, and vision features are extracted based on pre trained BERT, Wav2vec 2.0, and Faster R-CNN. Then, considering the complementarity and consistency among the modalities, the modality-shared and modality-specific encoders are constructed to learn shared and specific feature representations of the modalities. Finally, an adaptive multimodal fusion method based on an attention-based gated neural network is designed to eliminate noise features. Comprehensive experiments are conducted on the multimodal intent recognition MIntRec benchmark dataset. Our proposed model achieves higher accuracy, precision, recall, and F1-score than state-ofthe-art multimodal learning methods. We also conduct multimodal sentiment recognition experiments on the CMU-MOSI dataset, and our model still outperforms state-of-the-art methods. In addition, the experiment demonstrates that the model's multimodal representation well learned the modality's shared and specific features. The multimodal fusion of the model achieves adaptive fusion and effectively reduces possible noise interference. & COPY; 2023 Elsevier B.V. All rights reserved.

引用

页数：15

共 50 条

[21] Multimodal music emotion recognition method based on multi data fusion
Zeng, Fanguang
INTERNATIONAL JOURNAL OF ARTS AND TECHNOLOGY, 2023, 14 (04) : 271 - 282
[22] Exploring Multimodal Video Representation for Action Recognition
Wang, Cheng
Yang, Haojin
Meinel, Christoph
2016 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2016, : 1924 - 1931
[23] Disentangled Representation Learning for Multimodal Emotion Recognition
Yang, Dingkang
Huang, Shuai
Kuang, Haopeng
Du, Yangtao
Zhang, Lihua
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 1642 - 1651
[24] Prompt Learning for Multimodal Intent Recognition with Modal Alignment Perception
Chen, Yuzhao
Zhu, Wenhua
Yu, Weilun
Xue, Hongfei
Fu, Hao
Lin, Jiali
Jiang, Dazhi
COGNITIVE COMPUTATION, 2024, 16 (06) : 3417 - 3428
[25] Multimodal Biometric Person Recognition by Feature Fusion
Huang, Lin
Yu, Chenxi
Cao, Xinzhe
2018 5TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND CONTROL ENGINEERING (ICISCE 2018), 2018, : 1158 - 1162
[26] Multimodal fusion for alzheimer's disease recognition
Ying, Yangwei
Yang, Tao
Zhou, Hong
APPLIED INTELLIGENCE, 2023, 53 (12) : 16029 - 16040
[27] Multimodal Emotion Recognition Based on Feature Fusion
Xu, Yurui
Wu, Xiao
Su, Hang
Liu, Xiaorui
2022 INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM 2022), 2022, : 7 - 11
[28] MULTIMODAL TRANSFORMER FUSION FOR CONTINUOUS EMOTION RECOGNITION
Huang, Jian
Tao, Jianhua
Liu, Bin
Lian, Zheng
Niu, Mingyue
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3507 - 3511
[29] Fusion with Hierarchical Graphs for Multimodal Emotion Recognition
Tang, Shuyun
Luo, Zhaojie
Nan, Guoshun
Baba, Jun
Yoshikawa, Yuichiro
Ishiguro, Hiroshi
PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1288 - 1296
[30] Fusion Architectures for Multimodal Cognitive Load Recognition
Kindsvater, Daniel
Meudt, Sascha
Schwenker, Friedhelm
MULTIMODAL PATTERN RECOGNITION OF SOCIAL SIGNALS IN HUMAN-COMPUTER-INTERACTION, MPRSS 2016, 2017, 10183 : 36 - 47

← 1 2 3 4 5 →