Multi-modal Knowledge-Enhanced Fine-Grained Image Classification

被引:0
|
作者
Cheng, Suyan [1 ]
Zhang, Feifei [1 ]
Zhou, Haoliang [2 ]
Xu, Changsheng [3 ]
机构
[1] Tianjin Univ Technol, Tianjin, Peoples R China
[2] Jiangsu Univ Sci & Technol, Zhenjiang, Jiangsu, Peoples R China
[3] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Fine-grained image classification; Vision transformer; Scene text;
D O I
10.1007/978-981-97-8620-6_23
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In image classification tasks, visual appearance is generally considered as a crucial cue for understanding images. However, relying solely on visual information can lead to misclassification in fine-grained image classification tasks. Multi-modal knowledge has been proven to provide critical cues for various computer vision tasks, such as image retrieval and vision question answering. In this paper, we integrate multi-modal knowledge into visual features to enhance the model's understanding of visual content and accomplish fine-grained image classification tasks. Specifically, we adopt an effective visual enhanced module to capture global and local features, obtaining discriminative visual representations. Meanwhile, we employ knowledge distillation to transfer multi-modal knowledge from the Contrastive Language-Image Pretraining (CLIP) model to our model, improving its generalization ability. Moreover, we incorporate scene text into our visual features to provide richer contextual information. Experiments on the Con-Text, Drink Bottle, and Crowd Activity benchmark datasets demonstrate that our approach achieves 5.41%, 1.2%, and 7.55% improvements in mAP compared to the current state-of-the-art methods, respectively.
引用
收藏
页码:333 / 346
页数:14
相关论文
共 50 条
  • [1] Fine-Grained Image Classification Based on Multi-Modal Features and Enhanced Alignment
    Han, Jing
    Zhang, Tianpeng
    Lyu, Xueqiang
    Beijing Youdian Daxue Xuebao/Journal of Beijing University of Posts and Telecommunications, 2024, 47 (04): : 130 - 135
  • [2] MKTformer: Fine-grained Meter Classification Based on Multi-modal Knowledge Transfer
    Zheng, Zhaoye
    Zhang, Ke
    Shi, Chaojun
    Zheng, Fei
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1564 - 1570
  • [3] Multi-modal hierarchical fusion network for fine-grained paper classification
    Tan Yue
    Yong Li
    Jiedong Qin
    Zonghai Hu
    Multimedia Tools and Applications, 2024, 83 : 31527 - 31543
  • [4] Multi-modal hierarchical fusion network for fine-grained paper classification
    Yue, Tan
    Li, Yong
    Qin, Jiedong
    Hu, Zonghai
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (11) : 31527 - 31543
  • [5] Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval
    Mafla, Andres
    Dey, Sounak
    Biten, Ali Furkan
    Gomez, Lluis
    Karatzas, Dimosthenis
    2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 4022 - 4032
  • [6] Multi-Modal Domain Adaptation for Fine-Grained Action Recognition
    Munro, Jonathan
    Damen, Dima
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 119 - 129
  • [7] Multi-Modal Domain Adaptation for Fine-grained Action Recognition
    Munro, Jonathan
    Damen, Dima
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 3723 - 3726
  • [8] Cross-modal knowledge learning with scene text for fine-grained image classification
    Xiong, Li
    Mao, Yingchi
    Wang, Zicheng
    Nie, Bingbing
    Li, Chang
    IET IMAGE PROCESSING, 2024, 18 (06) : 1447 - 1459
  • [9] Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval
    He, Yi
    Liu, Xin
    Cheung, Yiu-ming
    Peng, Shu-Juan
    Yi, Jinhan
    Fan, Wentao
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1865 - 1869
  • [10] Automatic Fine-Grained BIM element classification using Multi-Modal deep learning (MMDL)
    Liu, Hao
    Gan, Vincent J. L.
    Cheng, Jack C. P.
    Zhou, Shanjing
    ADVANCED ENGINEERING INFORMATICS, 2024, 61