Positional Label for Self-Supervised Vision Transformer

被引:0
|
作者
Zhang, Zhemin [1 ]
Gong, Xun [1 ,2 ,3 ]
机构
[1] Southwest Jiaotong Univ, Sch Comp & Artificial Intelligence, Chengdu, Sichuan, Peoples R China
[2] Minist Educ, Engn Res Ctr Sustainable Urban Intelligent Transp, Beijing, Peoples R China
[3] Mfg Ind Chains Collaborat & Informat Support Tech, Chengdu, Sichuan, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Positional encoding is important for vision transformer (ViT) to capture the spatial structure of the input image. General effectiveness has been proven in ViT. In our work we propose to train ViT to recognize the positional label of patches of the input image, this apparently simple task actually yields a meaningful self-supervisory task. Based on previous work on ViT positional encoding, we propose two positional labels dedicated to 2D images including absolute position and relative position. Our positional labels can be easily plugged into various current ViT variants. It can work in two ways: (a) As an auxiliary training target for vanilla ViT for better performance. (b) Combine the self-supervised ViT to provide a more powerful self-supervised signal for semantic feature learning. Experiments demonstrate that with the proposed self-supervised methods, ViT-B and Swin-B gain improvements of 1.20% (top-1 Acc) and 0.74% (top-1 Acc) on ImageNet, respectively, and 6.15% and 1.14% improvement on Mini-ImageNet. The code is publicly available at: https://github.com/zhangzhemin/PositionalLabel.
引用
收藏
页码:3516 / 3524
页数:9
相关论文
共 50 条
  • [31] Multi-scale vision transformer classification model with self-supervised learning and dilated convolution
    Xing, Liping
    Jin, Hongmei
    Li, Hong-an
    Li, Zhanli
    COMPUTERS & ELECTRICAL ENGINEERING, 2022, 103
  • [32] Self-Supervised RGB-NIR Fusion Video Vision Transformer Framework for rPPG Estimation
    Park, Soyeon
    Kim, Bo-Kyeong
    Dong, Suh-Yeon
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2022, 71
  • [33] DatUS: Data-Driven Unsupervised Semantic Segmentation With Pretrained Self-Supervised Vision Transformer
    Kumar, Sonal
    Sur, Arijit
    Baruah, Rashmi Dutta
    IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2024, 16 (05) : 1775 - 1788
  • [34] Multimodal Image Fusion via Self-Supervised Transformer
    Zhang, Jing
    Liu, Yu
    Liu, Aiping
    Xie, Qingguo
    Ward, Rabab
    Wang, Z. Jane
    Chen, Xun
    IEEE SENSORS JOURNAL, 2023, 23 (09) : 9796 - 9807
  • [35] Self-supervised modal optimization transformer for image captioning
    Wang, Ye
    Li, Daitianxia
    Liu, Qun
    Liu, Li
    Wang, Guoyin
    Neural Computing and Applications, 2024, 36 (31) : 19863 - 19878
  • [36] Self-supervised Hypergraph Transformer with Alignment and Uniformity for Recommendation
    Yang, XianFeng
    Liu, Yang
    IAENG International Journal of Computer Science, 2024, 51 (03) : 292 - 300
  • [37] Self-Supervised Pretraining Transformer for Seismic Data Denoising
    Wang, Hongzhou
    Lin, Jun
    Li, Yue
    Dong, Xintong
    Tong, Xunqian
    Lu, Shaoping
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 25
  • [38] MST: Masked Self-Supervised Transformer for Visual Representation
    Li, Zhaowen
    Chen, Zhiyang
    Yang, Fan
    Li, Wei
    Zhu, Yousong
    Zhao, Chaoyang
    Deng, Rui
    Wu, Liwei
    Zhao, Rui
    Tang, Ming
    Wang, Jinqiao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [39] Self-Supervised Image Aesthetic Assessment Based on Transformer
    Jia, Minrui
    Wang, Guangao
    Wang, Zibei
    Yang, Shuai
    Ke, Yongzhen
    Wang, Kai
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, 2025, 24 (01)
  • [40] Self-supervised graph transformer networks for social recommendation
    Li, Qinyao
    Yang, Qimeng
    Tian, Shengwei
    Yu, Long
    COMPUTERS & ELECTRICAL ENGINEERING, 2025, 123