AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

被引:11
|
作者
Bandara, Wele Gedara Chaminda [1 ]
Patel, Naman [2 ]
Gholami, Ali [2 ]
Nikkhah, Mehdi [2 ]
Agrawal, Motilal [2 ]
Patel, Vishal M. [1 ]
机构
[1] Johns Hopkins Univ, Baltimore, MD 21218 USA
[2] Zippin, San Francisco, CA USA
关键词
D O I
10.1109/CVPR52729.2023.01394
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Masked Autoencoders (MAEs) learn generalizable representations for image, text, audio, video, etc., by reconstructing masked input data from tokens of the visible data. Current MAE approaches for videos rely on random patch, tube, or frame based masking strategies to select these tokens. This paper proposes AdaMAE, an adaptive masking strategy for MAEs that is end-to-end trainable. Our adaptive masking strategy samples visible tokens based on the semantic context using an auxiliary sampling network. This network estimates a categorical distribution over spacetime-patch tokens. The tokens that increase the expected reconstruction error are rewarded and selected as visible tokens, motivated by the policy gradient algorithm in reinforcement learning. We show that AdaMAE samples more tokens from the high spatiotemporal information regions, thereby allowing us to mask 95% of tokens, resulting in lower memory requirements and faster pre-training. We conduct ablation studies on the Something-Something v2 (SSv2) dataset to demonstrate the efficacy of our adaptive sampling approach and report state-of-the-art results of 70.0% and 81.7% in top-1 accuracy on SSv2 and Kinetics-400 action classification datasets with a ViT-Base backbone and 800 pre-training epochs. Code and pre-trained models are available at: https://github.com/wgcban/adamae.git.
引用
收藏
页码:14507 / 14517
页数:11
相关论文
共 50 条
  • [21] Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation
    Niizumi, Daisuke
    Takeuchi, Daiki
    Ohishi, Yasunori
    Harada, Noboru
    Kashino, Kunio
    HEAR: HOLISTIC EVALUATION OF AUDIO REPRESENTATIONS, VOL 166, 2021, 166 : 1 - 24
  • [22] Disjoint Masking With Joint Distillation for Efficient Masked Image Modeling
    Ma, Xin
    Liu, Chang
    Xie, Chunyu
    Ye, Long
    Deng, Yafeng
    Ji, Xiangyang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3077 - 3087
  • [23] Masked Autoencoders in 3D Point Cloud Representation Learning
    Jiang, Jincen
    Lu, Xuequan
    Zhao, Lizhi
    Dazeley, Richard
    Wang, Meili
    IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 820 - 831
  • [24] Ensembled masked graph autoencoders for link anomaly detection in a road network considering spatiotemporal features
    Yu, Wenhao
    Huang, Mengqiu
    Wu, Shangyou
    Zhang, Yifan
    INFORMATION SCIENCES, 2023, 622 : 456 - 475
  • [25] Efficient Transformer Inference for Extremely Weak Edge Devices using Masked Autoencoders
    Liu, Tao
    Li, Peng
    Gu, Yu
    Liu, Peng
    ICC 2023-IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, 2023, : 1718 - 1723
  • [26] Motion-Guided Masking for Spatiotemporal Representation Learning
    Fan, David
    Wang, Jue
    Liao, Shuai
    Zhu, Yi
    Bhat, Vimal
    Santos-Villalobos, Hector
    Rohith, M., V
    Li, Xinyu
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5596 - 5606
  • [27] Multilevel Contrastive Graph Masked Autoencoders for Unsupervised Graph-Structure Learning
    Fu, Sichao
    Peng, Qinmu
    He, Yang
    Wang, Xiaorui
    Zou, Bin
    Xu, Duanquan
    Jing, Xiao-Yuan
    You, Xinge
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (02) : 3464 - 3478
  • [28] Learn from Incomplete Tactile Data: Tactile Representation Learning with Masked Autoencoders
    Cao, Guanqun
    Jiang, Jiaqi
    Bollegala, Danushka
    Luo, Shan
    2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2023, : 10800 - 10805
  • [29] T-MAE : Temporal Masked Autoencoders for Point Cloud Representation Learning
    Wei, Weijie
    Nejadasl, Fatemeh Karimi
    Gevers, Theo
    Oswald, Martin R.
    COMPUTER VISION - ECCV 2024, PT XI, 2025, 15069 : 178 - 195
  • [30] Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation
    Liu, Jiaming
    Xu, Ran
    Yang, Senqiao
    Zhang, Renrui
    Zhang, Qizhe
    Chen, Zehui
    Guo, Yandong
    Zhang, Shanghang
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 28653 - 28663