Recombining Vision Transformer Architecture for Fine-Grained Visual Categorization

被引:1
|
作者
Deng, Xuran [1 ]
Liu, Chuanbin [1 ]
Lu, Zhiying [1 ]
机构
[1] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230026, Peoples R China
来源
基金
中国博士后科学基金;
关键词
Fine-Grained Visual Categorization; Vision Transformer;
D O I
10.1007/978-3-031-27818-1_11
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Fine-grained visual categorization (FGVC) is a challenging task in the image analysis field which requires comprehensive discriminative feature extraction and representation. To get around this problem, previous works focus on designing complex modules, the so-called necks and heads, over simple backbones, while bringing a huge computational burden. In this paper, we bring a new insight: Vision Transformer itself is an all-in-one FGVC framework that consists of basic Backbone for feature extraction, Neck for further feature enhancement and Head for selecting discriminative feature. We delve into the feature extraction and representation pattern of ViT for FGVC and empirically show that simply recombining the original ViT structure to leverage multi-level semantic representation without introducing any other parameters is able to achieve higher performance. Under such insight, we proposed RecViT, a simple recombination and modification of original ViT, which can capture multi-level semantic features and facilitate fine-grained recognition. In RecViT, the deep layers of the original ViT are served as Head, a few middle layers as Neck and shallow layers as Backbone. In addition, we adopt an optional Feature Processing Module to enhance discriminative feature representation at each semantic level and align them for final recognition. With the above simple modifications, RecViT obtains significant improvement in accuracy in FGVC benchmarks: CUB-200-2011, Stanford Cars and Stanford Dogs.
引用
收藏
页码:127 / 138
页数:12
相关论文
共 50 条
  • [41] Filtration and Distillation: Enhancing Region Attention for Fine-Grained Visual Categorization
    Liu, Chuanbin
    Xie, Hongtao
    Zha, Zheng-Jun
    Ma, Lingfeng
    Yu, Lingyun
    Zhang, Yongdong
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11555 - 11562
  • [42] Multiscale attention dynamic aware network for fine-grained visual categorization
    Ou, Jichu
    Li, Wanyi
    Huang, Jingmin
    Huang, Xiaojie
    Xie, Xuan
    ELECTRONICS LETTERS, 2023, 59 (01)
  • [43] Classification-Specific Parts for Improving Fine-Grained Visual Categorization
    Korsch, Dimitri
    Bodesheim, Paul
    Denzler, Joachim
    PATTERN RECOGNITION, DAGM GCPR 2019, 2019, 11824 : 62 - 75
  • [44] Coarse Label Refined Knowledge Reasoning for Fine-Grained Visual Categorization
    Zhao, Xiangyu
    Peng, Yuxin
    INTELLIGENCE SCIENCE AND BIG DATA ENGINEERING, 2018, 11266 : 349 - 359
  • [45] Fine-Grained Visual Categorization by Localizing Object Parts With Single Image
    Zheng, Xiangtao
    Qi, Lei
    Ren, Yutao
    Lu, Xiaoqiang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 1187 - 1199
  • [46] Exploring part-aware segmentation for fine-grained visual categorization
    Pang, Cheng
    Yao, Hongxun
    Sun, Xiaoshuai
    Zhao, Sicheng
    Zhang, Yanhao
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (23) : 30291 - 30310
  • [47] Attention Convolutional Binary Neural Tree for Fine-Grained Visual Categorization
    Ji, Ruyi
    Wen, Longyin
    Zhang, Libo
    Du, Dawei
    Wu, Yanjun
    Zhao, Chen
    Liu, Xianglong
    Huang, Feiyue
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 10465 - 10474
  • [48] A benchmark dataset and approach for fine-grained visual categorization in complex scenes
    Zhang, Xiang
    Zhang, Keran
    Zhao, Wanqing
    Luo, Hangzai
    Zhong, Sheng
    Tang, Lei
    Peng, Jinye
    Fan, Jianping
    DIGITAL SIGNAL PROCESSING, 2023, 137
  • [49] PFNet: a novel part fusion network for fine-grained visual categorization
    Jingyun Liang
    Jinlin Guo
    Yanming Guo
    Songyang Lao
    Multimedia Tools and Applications, 2020, 79 : 33397 - 33416
  • [50] VegFru: A Domain-Specific Dataset for Fine-grained Visual Categorization
    Hou, Saihui
    Feng, Yushan
    Wang, Zilei
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 541 - 549