Token-Selective Vision Transformer for fine-grained image recognition of marine organisms

被引:8
|
作者
Si, Guangzhe [1 ]
Xiao, Ying [2 ]
Wei, Bin [3 ]
Bullock, Leon Bevan [4 ]
Wang, Yueyue [5 ]
Wang, Xiaodong [4 ]
机构
[1] Ocean Univ China, Coll Elect Engn, Qingdao, Shandong, Peoples R China
[2] Hong Kong Univ Sci & Technol, Sch Sci, Hong Kong, Peoples R China
[3] Qingdao Univ, Affiliated Hosp, Shandong Key Lab Digital Med & Comp Assisted Surg, Qingdao, Shandong, Peoples R China
[4] Ocean Univ China, Coll Comp Sci & Technol, Qingdao, Shandong, Peoples R China
[5] Ocean Univ China, Comp Ctr, Qingdao, Shandong, Peoples R China
基金
中国国家自然科学基金;
关键词
token-selective; self-attention; vision transformer; fine-grained image classification; marine organisms;
D O I
10.3389/fmars.2023.1174347
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
IntroductionThe objective of fine-grained image classification on marine organisms is to distinguish the subtle variations in the organisms so as to accurately classify them into subcategories. The key to accurate classification is to locate the distinguishing feature regions, such as the fish's eye, fins, or tail, etc. Images of marine organisms are hard to work with as they are often taken from multiple angles and contain different scenes, additionally they usually have complex backgrounds and often contain human or other distractions, all of which makes it difficult to focus on the marine organism itself and identify its most distinctive features. Related workMost existing fine-grained image classification methods based on Convolutional Neural Networks (CNN) cannot accurately enough locate the distinguishing feature regions, and the identified regions also contain a large amount of background data. Vision Transformer (ViT) has strong global information capturing abilities and gives strong performances in traditional classification tasks. The core of ViT, is a Multi-Head Self-Attention mechanism (MSA) which first establishes a connection between different patch tokens in a pair of images, then combines all the information of the tokens for classification. MethodsHowever, not all tokens are conducive to fine-grained classification, many of them contain extraneous data (noise). We hope to eliminate the influence of interfering tokens such as background data on the identification of marine organisms, and then gradually narrow down the local feature area to accurately determine the distinctive features. To this end, this paper put forwards a novel Transformer-based framework, namely Token-Selective Vision Transformer (TSVT), in which the Token-Selective Self-Attention (TSSA) is proposed to select the discriminating important tokens for attention computation which helps limits the attention to more precise local regions. TSSA is applied to different layers, and the number of selected tokens in each layer decreases on the basis of the previous layer, this method gradually locates the distinguishing regions in a hierarchical manner. ResultsThe effectiveness of TSVT is verified on three marine organism datasets and it is demonstrated that TSVT can achieve the state-of-the-art performance.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Token Adaptive Vision Transformer with Efficient Deployment for Fine-Grained Image Recognition
    Lee, Chonghan
    Brufau, Rita Brugarolas
    Ding, Ke
    Narayanan, Vijaykrishnan
    2023 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, DATE, 2023,
  • [2] A Sequence-selective Fine-grained Image Recognition Strategy Using Vision Transformer
    Cai, Yulin
    Wang, Haoqian
    Wang, Xingzheng
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGING SYSTEMS AND TECHNIQUES (IST 2022), 2022,
  • [3] Survey of Vision Transformer in Fine-Grained Image Classification
    Sun, Lulu
    Liu, Jianping
    Wang, Jian
    Xing, Jialu
    Zhang, Yue
    Wang, Chenyang
    Computer Engineering and Applications, 60 (10): : 30 - 46
  • [4] Hybrid Granularities Transformer for Fine-Grained Image Recognition
    Yu, Ying
    Wang, Jinghui
    ENTROPY, 2023, 25 (04)
  • [5] MULTI-EXIT VISION TRANSFORMER WITH CUSTOM FINE-TUNING FOR FINE-GRAINED IMAGE RECOGNITION
    Shen, Tianyi
    Lee, Chonghan
    Narayanan, Vijaykrishnan
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2830 - 2834
  • [6] Group-Attention Transformer for Fine-Grained Image Recognition
    Yan, Bo
    Wang, Siwei
    Zhu, En
    Liu, Xinwang
    Chen, Wei
    Communications in Computer and Information Science, 2022, 1587 CCIS : 40 - 54
  • [7] Selective Sparse Sampling for Fine-grained Image Recognition
    Ding, Yao
    Zhou, Yanzhao
    Zhu, Yi
    Ye, Qixiang
    Jiao, Jianbin
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6598 - 6607
  • [8] Ts-vit: feature-enhanced transformer via token selection for fine-grained image recognition
    Wang, Yingge
    Liang, Hu
    Wen, Changchun
    Zhao, Shengrong
    SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (01)
  • [9] Efficient Vision Transformer With Token-Selective and Merging Strategies for Autonomous Underwater Vehicles
    Jiang, Yu
    Zhang, Yongji
    Wang, Yuehang
    Guo, Qianren
    Zhao, Minghao
    Qin, Hongde
    IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (19): : 32129 - 32142
  • [10] Associating multiple vision transformer layers for fine-grained image representation
    Sun, Fayou
    Ngo, Hea Choon
    Sek, Yong Wee
    Meng, Zuqiang
    AI OPEN, 2023, 4 : 130 - 136