AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

被引：11

作者：

Bandara, Wele Gedara Chaminda ^{[1
]}

Patel, Naman ^{[2
]}

Gholami, Ali ^{[2
]}

Nikkhah, Mehdi ^{[2
]}

Agrawal, Motilal ^{[2
]}

Patel, Vishal M. ^{[1
]}

机构：

[1] Johns Hopkins Univ, Baltimore, MD 21218 USA

[2] Zippin, San Francisco, CA USA

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.01394

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Masked Autoencoders (MAEs) learn generalizable representations for image, text, audio, video, etc., by reconstructing masked input data from tokens of the visible data. Current MAE approaches for videos rely on random patch, tube, or frame based masking strategies to select these tokens. This paper proposes AdaMAE, an adaptive masking strategy for MAEs that is end-to-end trainable. Our adaptive masking strategy samples visible tokens based on the semantic context using an auxiliary sampling network. This network estimates a categorical distribution over spacetime-patch tokens. The tokens that increase the expected reconstruction error are rewarded and selected as visible tokens, motivated by the policy gradient algorithm in reinforcement learning. We show that AdaMAE samples more tokens from the high spatiotemporal information regions, thereby allowing us to mask 95% of tokens, resulting in lower memory requirements and faster pre-training. We conduct ablation studies on the Something-Something v2 (SSv2) dataset to demonstrate the efficacy of our adaptive sampling approach and report state-of-the-art results of 70.0% and 81.7% in top-1 accuracy on SSv2 and Kinetics-400 action classification datasets with a ViT-Base backbone and 800 pre-training epochs. Code and pre-trained models are available at: https://github.com/wgcban/adamae.git.

引用

页码：14507 / 14517

页数：11

共 50 条

[41] PatchMixing Masked Autoencoders for 3D Point Cloud Self-Supervised Learning
Lin, Chengxing
Xu, Wenju
Zhu, Jian
Nie, Yongwei
Cai, Ruichu
Xu, Xuemiao
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9882 - 9897
[42] Domain Invariant Masked Autoencoders for Self-supervised Learning from Multi-domains
Yang, Haiyang
Tang, Shixiang
Chen, Meilin
Wang, Yizhou
Zhu, Feng
Bai, Lei
Zhao, Rui
Ouyang, Wanli
COMPUTER VISION, ECCV 2022, PT XXXI, 2022, 13691 : 151 - 168
[43] HAT-GAE: Self-supervised graph autoencoders with hierarchical adaptive masking and trainable corruption
Sun, Chengyu
Hu, Liang
Li, Hongtu
Li, Shuai
Li, Tuohang
Chi, Ling
KNOWLEDGE-BASED SYSTEMS, 2023, 279
[44] Learning Efficient, Collective Monte Carlo Moves with Variational Autoencoders
Monroe, Jacob, I
Shen, Vincent K.
JOURNAL OF CHEMICAL THEORY AND COMPUTATION, 2022, 18 (06) : 3622 - 3636
[45] Masked Siamese Networks for Label-Efficient Learning
Assran, Mahmoud
Caron, Mathilde
Misra, Ishan
Bojanowski, Piotr
Bordes, Florian
Vincent, Pascal
Joulin, Armand
Rabbat, Mike
Ballas, Nicolas
COMPUTER VISION, ECCV 2022, PT XXXI, 2022, 13691 : 456 - 473
[46] A Self-adaptive Learning Rate Principle for Stacked Denoising Autoencoders
HAO Qianqian
DING Jinkou
WANG Jianfei
软件, 2015, 36 (09) : 82 - 86
[47] Automatic Modulation Recognition for Radio Frequency Proximity Sensor Signals Based on Masked Autoencoders and Transfer Learning
Yi, Guanghua
Hao, Xinhong
Yan, Xiaopeng
Wang, Jiawei
Dai, Jian
IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS, 2024, 60 (06) : 8700 - 8712
[48] Spatiotemporal adaptive quantization for an efficient video rate control
Lee, SW
Kim, WJ
Kim, K
OPTICAL ENGINEERING, 2005, 44 (06) : 1 - 2
[49] Fully Self-Supervised Out-of-Domain Few-Shot Learning with Masked Autoencoders
Walsh, Reece
Osman, Islam
Abdelaziz, Omar
Shehata, Mohamed S.
JOURNAL OF IMAGING, 2024, 10 (01)
[50] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Tong, Zhan
Song, Yibing
Wang, Jue
Wang, Limin
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,

← 1 2 3 4 5 →