SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders

被引:0
|
作者
Li, Gang [1 ,2 ]
Zheng, Heliang [3 ]
Liu, Daqing [3 ]
Wang, Chaoyue [3 ]
Su, Bing [4 ]
Zheng, Changwen [1 ]
机构
[1] Chinese Acad Sci, Inst Software, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] JD Explore Acad, Beijing, Peoples R China
[4] Renmin Univ China, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, significant progress has been made in masked image modeling to catch up to masked language modeling. However, unlike words in NLP, the lack of semantic decomposition of images still makes masked autoencoding (MAE) different between vision and language. In this paper, we explore a potential visual analogue of words, i.e., semantic parts, and we integrate semantic information into the training process of MAE by proposing a Semantic-Guided Masking strategy. Compared to widely adopted random masking, our masking strategy can gradually guide the network to learn various information, i.e., from intra-part patterns to inter-part relations. In particular, we achieve this in two steps. 1) Semantic part learning: we design a self-supervised part learning method to obtain semantic parts by leveraging and refining the multi-head attention of a ViT-based encoder. 2) Semantic-guided MAE (SemMAE) training: we design a masking strategy that varies from masking a portion of patches in each part to masking a portion of (whole) parts in an image. Extensive experiments on various vision tasks show that SemMAE can learn better image representation by integrating semantic information. In particular, SemMAE achieves 84.5% fine-tuning accuracy on ImageNet-1k, which outperforms the vanilla MAE by 1.4%. In the semantic segmentation and fine-grained recognition tasks, SemMAE also brings significant improvements and yields the state-of-the-art performance. Our code is available at https://github.com/ucasligang/SemMAE.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
    Wang, Limin
    Huang, Bingkun
    Zhao, Zhiyu
    Tong, Zhan
    He, Yinan
    Wang, Yi
    Wang, Yali
    Qiao, Yu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14549 - 14560
  • [32] Semantic-Guided Zero-Shot Learning for Low-Light Image/Video Enhancement
    Zheng, Shen
    Gupta, Gaurav
    2022 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW 2022), 2022, : 581 - 590
  • [33] A semantic-guided and self-configurable framework for video analysis
    SanMiguel, Juan C.
    Martinez, Jose M.
    MACHINE VISION AND APPLICATIONS, 2013, 24 (03) : 493 - 512
  • [34] A semantic-guided and self-configurable framework for video analysis
    Juan C. SanMiguel
    José M. Martínez
    Machine Vision and Applications, 2013, 24 : 493 - 512
  • [35] Semantic-guided fuzzing for virtual testing of autonomous driving systems
    Guo, An
    Feng, Yang
    Cheng, Yizhen
    Chen, Zhenyu
    JOURNAL OF SYSTEMS AND SOFTWARE, 2024, 212
  • [36] SFD2: Semantic-guided Feature Detection and Description
    Xue, Fei
    Budvytis, Ignas
    Cipolla, Roberto
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 5206 - 5216
  • [37] Semantic-guided complementary fusion network for salient object detection
    Yang, Kunqian
    He, Caitou
    NEUROCOMPUTING, 2025, 622
  • [38] MGMAE: Motion Guided Masking for Video Masked Autoencoding
    Huang, Bingkun
    Zhao, Zhiyu
    Zhang, Guozhen
    Qiao, Yu
    Wang, Limin
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13447 - 13458
  • [39] SGIQA: Semantic-Guided No-Reference Image Quality Assessment
    Pan, Linpeng
    Zhang, Xiaozhe
    Xie, Fengying
    Zhang, Haopeng
    Zheng, Yushan
    IEEE TRANSACTIONS ON BROADCASTING, 2024, 70 (04) : 1292 - 1301
  • [40] Semantic-guided graph neural network for heterogeneous graph embedding
    Han, Mingjing
    Zhang, Han
    Li, Wei
    Yin, Yanbin
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 232