The Multimodal Scene Recognition Method Based on Self-Attention and Distillation

被引:0
|
作者
Sun, Ning [1 ]
Xu, Wei [1 ]
Liu, Jixin [1 ]
Chai, Lei [1 ]
Sun, Haian [1 ]
机构
[1] Nanjing Univ Posts & Telecommun, Nanjing 210003, Peoples R China
关键词
Feature extraction; Training; Image recognition; Transformers; Layout; Convolutional neural networks; Sun; NETWORK;
D O I
10.1109/MMUL.2024.3415643
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Scene recognition is a challenging task in computer vision because of the diversity of objects in scene images and the ambiguity of object layouts. In recent years, the emergence of multimodal scene data has provided new solutions for scene recognition, but it has also brought new problems. To address these challenges, the self-attention and distillation-based multimodal scene recognition network (SAD-MSR) is proposed in this article. The backbone of the model adopts the pure transformer structure of self-attention, which can extract local and global spatial features of multimodal scene images. A multistage fusion mechanism was developed for this model in which the concatenated tokens of two modalities are fused based on self-attention in the early stage, while the high-level features extracted from the two modalities are fused based on cross attention in the late stage. Furthermore, a distillation mechanism is introduced to alleviate the problem of a limited number of training samples. Finally, we conducted extensive experiments on two multimodal scene recognition databases, SUN RGB-D and NYU Depth, to show the effectiveness of SAD-MSR. Compared with other state-of-the-art multimodal scene recognition methods, our method can achieve better experimental results.
引用
收藏
页码:25 / 36
页数:12
相关论文
共 50 条
  • [1] Multimodal Fusion Method Based on Self-Attention Mechanism
    Zhu, Hu
    Wang, Ze
    Shi, Yu
    Hua, Yingying
    Xu, Guoxia
    Deng, Lizhen
    WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2020, 2020
  • [2] Multimodal cooperative self-attention network for action recognition
    Zhong, Zhuokun
    Hou, Zhenjie
    Liang, Jiuzhen
    Lin, En
    Shi, Haiyong
    IET IMAGE PROCESSING, 2023, 17 (06) : 1775 - 1783
  • [3] Masked face recognition based on knowledge distillation and convolutional self-attention network
    Wan, Weiguo
    Wen, Runlin
    Yao, Li
    Yang, Yong
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, : 2269 - 2284
  • [4] Multi-modal Scene Recognition Based on Global Self-attention Mechanism
    Li, Xiang
    Sun, Ning
    Liu, Jixin
    Chai, Lei
    Sun, Haian
    ADVANCES IN NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, ICNC-FSKD 2022, 2023, 153 : 109 - 121
  • [5] Progressive Scene Segmentation Based on Self-Attention Mechanism
    Pan, Yunyi
    Gan, Yuan
    Liu, Kun
    Zhang, Yan
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3985 - 3992
  • [6] MGSAN: multimodal graph self-attention network for skeleton-based action recognition
    Wang, Junyi
    Li, Ziao
    Liu, Bangli
    Cai, Haibin
    Saada, Mohamad
    Meng, Qinggang
    MULTIMEDIA SYSTEMS, 2024, 30 (06)
  • [7] MULTIMODAL CROSS- AND SELF-ATTENTION NETWORK FOR SPEECH EMOTION RECOGNITION
    Sun, Licai
    Liu, Bin
    Tao, Jianhua
    Lian, Zheng
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4275 - 4279
  • [8] Water Puddle Detection Method for Road Scene Based on Self-Attention and Adversarial Learning
    Wang C.-Y.
    Wang H.
    Meng C.
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2023, 51 (08): : 2213 - 2225
  • [9] Micro-Expression Recognition Method Based on Transformer with Separable Self-Attention
    Yang, Peng
    Zeng, Zhifeng
    Zhu, Tianyuan
    2024 4TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND INTELLIGENT SYSTEMS ENGINEERING, MLISE 2024, 2024, : 90 - 95
  • [10] Multimodal Recommendation System Based on Cross Self-Attention Fusion
    Li, Peishan
    Zhan, Weixiao
    Gao, Lutao
    Wang, Shuran
    Yang, Linnan
    SYSTEMS, 2025, 13 (01):