Positional Label for Self-Supervised Vision Transformer

被引:0
|
作者
Zhang, Zhemin [1 ]
Gong, Xun [1 ,2 ,3 ]
机构
[1] Southwest Jiaotong Univ, Sch Comp & Artificial Intelligence, Chengdu, Sichuan, Peoples R China
[2] Minist Educ, Engn Res Ctr Sustainable Urban Intelligent Transp, Beijing, Peoples R China
[3] Mfg Ind Chains Collaborat & Informat Support Tech, Chengdu, Sichuan, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Positional encoding is important for vision transformer (ViT) to capture the spatial structure of the input image. General effectiveness has been proven in ViT. In our work we propose to train ViT to recognize the positional label of patches of the input image, this apparently simple task actually yields a meaningful self-supervisory task. Based on previous work on ViT positional encoding, we propose two positional labels dedicated to 2D images including absolute position and relative position. Our positional labels can be easily plugged into various current ViT variants. It can work in two ways: (a) As an auxiliary training target for vanilla ViT for better performance. (b) Combine the self-supervised ViT to provide a more powerful self-supervised signal for semantic feature learning. Experiments demonstrate that with the proposed self-supervised methods, ViT-B and Swin-B gain improvements of 1.20% (top-1 Acc) and 0.74% (top-1 Acc) on ImageNet, respectively, and 6.15% and 1.14% improvement on Mini-ImageNet. The code is publicly available at: https://github.com/zhangzhemin/PositionalLabel.
引用
收藏
页码:3516 / 3524
页数:9
相关论文
共 50 条
  • [41] Unsupervised object discovery with pseudo label generated using K-means and self-supervised transformer
    Lim, SeongTaek
    Park, JaeEon
    Lee, MinYoung
    Lee, HongChul
    NEUROCOMPUTING, 2023, 545
  • [42] Transformer-based correlation mining network with self-supervised label generation for multimodal sentiment analysis
    Wang, Ruiqing
    Yang, Qimeng
    Tian, Shengwei
    Yu, Long
    He, Xiaoyu
    Wang, Bo
    NEUROCOMPUTING, 2025, 618
  • [43] Self-supervised Label Augmentation via Input Transformations
    Lee, Hankook
    Hwang, Sung Ju
    Shin, Jinwoo
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
  • [44] Self-supervised knowledge distillation for complementary label learning
    Liu, Jiabin
    Li, Biao
    Lei, Minglong
    Shi, Yong
    NEURAL NETWORKS, 2022, 155 : 318 - 327
  • [45] Self-supervised vision transformer-based few-shot learning for facial expression recognition
    Chen, Xuanchi
    Zheng, Xiangwei
    Sun, Kai
    Liu, Weilong
    Zhang, Yuang
    INFORMATION SCIENCES, 2023, 634 : 206 - 226
  • [46] Identification of Dental Lesions Using Self-Supervised Vision Transformer in Radiographic X-ray Images
    Li Y.
    Zhao H.
    Yang D.
    Du S.
    Cui X.
    Zhang J.
    Computer-Aided Design and Applications, 2024, 21 (S23): : 332 - 342
  • [47] ST-VTON: Self-supervised vision transformer for image-based virtual try-on
    Chong, Zheng
    Mo, Lingfei
    IMAGE AND VISION COMPUTING, 2022, 127
  • [48] SERE: Exploring Feature Self-Relation for Self-Supervised Transformer
    Li, Zhong-Yu
    Gao, Shanghua
    Cheng, Ming-Ming
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (12) : 15619 - 15631
  • [49] Self-Supervised Domain Adaptation for Computer Vision Tasks
    Xu, Jiaolong
    Xiao, Liang
    Lopez, Antonio M.
    IEEE ACCESS, 2019, 7 : 156694 - 156706
  • [50] Self-supervised learning in cooperative stereo vision correspondence
    Decoux, B
    INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 1997, 8 (01) : 101 - 111