Adaptive Masked Autoencoder Transformer for image classification

被引：1

作者：

Chen, Xiangru ^{[1
,2
]}

Liu, Chenjing ^{[1
,2
]}

Hu, Peng

Lin, Jie ^{[1
,2
]}

Gong, Yunhong

Chen, Yingke ^{[4
]}

Peng, Dezhong ^{[1
,3
]}

Geng, Xue ^{[2
]}

机构：

[1] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China

[2] Agcy Sci Technol & Res, Inst Infocomm Res, Singapore 138632, Singapore

[3] Sichuan Newstrong UHD Video Technol Co Ltd, Chengdu 610095, Peoples R China

[4] Northumbria Univ, Dept Comp & Informat Sci, Newcastle Upon Tyne NE1 8ST, England

来源：

APPLIED SOFT COMPUTING | 2024年 / 164卷

基金：

中国国家自然科学基金;

关键词：

Vision transformer; Masked image modeling; Image classification;

D O I：

10.1016/j.asoc.2024.111958

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision Transformers (ViTs) have exhibited exceptional performance across a broad spectrum of visual tasks. Nonetheless, their computational requirements often surpass those of prevailing CNN-based models. Token sparsity techniques have been employed as a means to alleviate this issue. Regrettably, these techniques often result in the loss of semantic information and subsequent deterioration in performance. In order to address these challenges, we propose the Adaptive Masked Autoencoder Transformer (AMAT), a masked image modeling-based method. AMAT integrates a novel adaptive masking mechanism and a training objective function for both pre-training and fine-tuning stages. Our primary objective is to reduce the complexity of Vision Transformer models while concurrently enhancing their final accuracy. Through experiments conducted on the ILSVRC-2012 dataset, our proposed method surpasses the original ViT by achieving up to 40% FLOPs savings. Moreover, AMAT outperforms the efficient DynamicViT model by 0.1% while saving 4% FLOPs. Furthermore, on the Places365 dataset, AMAT achieves a 0.3% accuracy loss while saving 21% FLOPs compared to MAE. These findings effectively demonstrate the efficacy of AMAT in mitigating computational complexity while maintaining a high level of accuracy.

引用

页数：10

共 50 条

[11] Cervical OCT image classification using contrastive masked autoencoders with Swin Transformer
Wang, Qingbin
Xiong, Yuxuan
Zhu, Hanfeng
Mu, Xuefeng
Zhang, Yan
Ma, Yutao
COMPUTERIZED MEDICAL IMAGING AND GRAPHICS, 2024, 118
[12] SS-MAE: Spatial–Spectral Masked Autoencoder for Multisource Remote Sensing Image Classification
Lin, Junyan
Gao, Feng
Shi, Xiaochen
Dong, Junyu
Du, Qian
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61 : 1 - 14
[13] Medical Image Classification Using Self-Supervised Learning-Based Masked Autoencoder
Fan, Zong
Wang, Zhimin
Gong, Ping
Lee, Christine U.
Tang, Shanshan
Zhang, Xiaohui
Hao, Yao
Zhang, Zhongwei
Song, Pengfei
Chen, Shigao
Li, Hua
MEDICAL IMAGING 2024: IMAGE PROCESSING, 2024, 12926
[14] Masked Transformer for Image Anomaly Localization
De Nardin, Axel
Mishra, Pankaj
Foresti, Gian Luca
Piciarelli, Claudio
INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2022, 32 (07)
[15] DEMAE: Diffusion-Enhanced Masked Autoencoder for Hyperspectral Image Classification With Few Labeled Samples
Li, Ziyu
Xue, Zhaohui
Jia, Mingming
Nie, Xiangyu
Wu, Hao
Zhang, Mengxue
Su, Hongjun
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
[16] Continuously Masked Transformer for Image Inpainting
Ko, Keunsoo
Kim, Chang-Su
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13123 - 13132
[17] MaskGIT: Masked Generative Image Transformer
Chang, Huiwen
Zhang, Han
Jiang, Lu
Liu, Ce
Freeman, William T.
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 11305 - 11315
[18] Adaptive Transformer-Based Conditioned Variational Autoencoder for Incomplete Social Event Classification
Li, Zhangming
Qian, Shengsheng
Cao, Jie
Fang, Quan
Xu, Changsheng
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 1698 - 1707
[19] Masked Autoencoder Pretraining for Event Classification in Elite Soccer
Rudolph, Yannick
Brefeld, Ulf
MACHINE LEARNING AND DATA MINING FOR SPORTS ANALYTICS, MLSA 2023, 2024, 2035 : 24 - 35
[20] Spectral-Spatial Masked Transformer With Supervised and Contrastive Learning for Hyperspectral Image Classification
Huang, Lingbo
Chen, Yushi
He, Xin
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61

← 1 2 3 4 5 →