Positional Label for Self-Supervised Vision Transformer

被引：0

作者：

Zhang, Zhemin ^{[1
]}

Gong, Xun ^{[1
,2
,3
]}

机构：

[1] Southwest Jiaotong Univ, Sch Comp & Artificial Intelligence, Chengdu, Sichuan, Peoples R China

[2] Minist Educ, Engn Res Ctr Sustainable Urban Intelligent Transp, Beijing, Peoples R China

[3] Mfg Ind Chains Collaborat & Informat Support Tech, Chengdu, Sichuan, Peoples R China

来源：

THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3 | 2023年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Positional encoding is important for vision transformer (ViT) to capture the spatial structure of the input image. General effectiveness has been proven in ViT. In our work we propose to train ViT to recognize the positional label of patches of the input image, this apparently simple task actually yields a meaningful self-supervisory task. Based on previous work on ViT positional encoding, we propose two positional labels dedicated to 2D images including absolute position and relative position. Our positional labels can be easily plugged into various current ViT variants. It can work in two ways: (a) As an auxiliary training target for vanilla ViT for better performance. (b) Combine the self-supervised ViT to provide a more powerful self-supervised signal for semantic feature learning. Experiments demonstrate that with the proposed self-supervised methods, ViT-B and Swin-B gain improvements of 1.20% (top-1 Acc) and 0.74% (top-1 Acc) on ImageNet, respectively, and 6.15% and 1.14% improvement on Mini-ImageNet. The code is publicly available at: https://github.com/zhangzhemin/PositionalLabel.

引用

页码：3516 / 3524

页数：9

共 50 条

[31] Multi-scale vision transformer classification model with self-supervised learning and dilated convolution
Xing, Liping
Jin, Hongmei
Li, Hong-an
Li, Zhanli
COMPUTERS & ELECTRICAL ENGINEERING, 2022, 103
[32] Self-Supervised RGB-NIR Fusion Video Vision Transformer Framework for rPPG Estimation
Park, Soyeon
Kim, Bo-Kyeong
Dong, Suh-Yeon
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2022, 71
[33] DatUS: Data-Driven Unsupervised Semantic Segmentation With Pretrained Self-Supervised Vision Transformer
Kumar, Sonal
Sur, Arijit
Baruah, Rashmi Dutta
IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2024, 16 (05) : 1775 - 1788
[34] Multimodal Image Fusion via Self-Supervised Transformer
Zhang, Jing
Liu, Yu
Liu, Aiping
Xie, Qingguo
Ward, Rabab
Wang, Z. Jane
Chen, Xun
IEEE SENSORS JOURNAL, 2023, 23 (09) : 9796 - 9807
[35] Self-supervised modal optimization transformer for image captioning
Wang, Ye
Li, Daitianxia
Liu, Qun
Liu, Li
Wang, Guoyin
Neural Computing and Applications, 2024, 36 (31) : 19863 - 19878
[36] Self-supervised Hypergraph Transformer with Alignment and Uniformity for Recommendation
Yang, XianFeng
Liu, Yang
IAENG International Journal of Computer Science, 2024, 51 (03) : 292 - 300
[37] Self-Supervised Pretraining Transformer for Seismic Data Denoising
Wang, Hongzhou
Lin, Jun
Li, Yue
Dong, Xintong
Tong, Xunqian
Lu, Shaoping
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 25
[38] MST: Masked Self-Supervised Transformer for Visual Representation
Li, Zhaowen
Chen, Zhiyang
Yang, Fan
Li, Wei
Zhu, Yousong
Zhao, Chaoyang
Deng, Rui
Wu, Liwei
Zhao, Rui
Tang, Ming
Wang, Jinqiao
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[39] Self-Supervised Image Aesthetic Assessment Based on Transformer
Jia, Minrui
Wang, Guangao
Wang, Zibei
Yang, Shuai
Ke, Yongzhen
Wang, Kai
INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, 2025, 24 (01)
[40] Self-supervised graph transformer networks for social recommendation
Li, Qinyao
Yang, Qimeng
Tian, Shengwei
Yu, Long
COMPUTERS & ELECTRICAL ENGINEERING, 2025, 123

← 1 2 3 4 5 →