Repeat and learn: Self-supervised visual representations learning by Scene Localization

被引:1
|
作者
Altabrawee, Hussein [1 ,2 ]
Noor, Mohd Halim Mohd [1 ]
机构
[1] Univ Sains Malaysia, Sch Comp Sci, Main Campus, Gelugor 11800, Penang, Malaysia
[2] Al Muthanna Univ, Comp Ctr, Main Campus, Samawah 66001, Al Muthanna, Iraq
关键词
Visual representations learning; Action recognition; Self-supervised learning;
D O I
10.1016/j.patcog.2024.110804
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large labeled datasets are crucial for video understanding progress. However, the labeling process is timeconsuming, expensive, and tiresome. To overcome this impediment, various pretexts use the temporal coherence in videos to learn visual representations in a self-supervised manner. However, these pretexts (order verification and sequence sorting) struggle when encountering cyclic actions due to the label ambiguity problem. To overcome these limitations, we present a novel temporal pretext task to address self-supervised learning of visual representations from unlabeled videos. Repeated Scene Localization (RSL) is a multi-class classification pretext that involves changing the temporal order of the frames in a video by repeating a scene. Then, the network is trained to identify the modified video, localize the location of the repeated scene, and identify the unmodified original videos that do not have repeated scenes. We evaluated the proposed pretext on two benchmark datasets, UCF-101 and HMDB-51. The experimental results show that the proposed pretext achieves state-of-the-art results in action recognition and video retrieval tasks. In action recognition, our S3D model achieves 88.15% and 56.86% on UCF-101 and HMDB-51, respectively. It outperforms the current state-of-the-art by 1.05% and 3.26%. Our R(2+1)D-Adjacent model achieves 83.52% and 54.50% on UCF-101 and HMDB-51, respectively. It outperforms the single pretext tasks by 8.7% and 13.9%. In video retrieval, our R(2+1)D-Offset model outperforms the single pretext tasks by 4.68% and 1.1% Top 1 accuracies on UCF-101 and HMDB-51, respectively. The source code and the trained models are publicly available at https://github.com/Hussein-A-Hassan/RSL-Pretext.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Learning Self-Supervised Multimodal Representations of Human Behaviour
    Shukla, Abhinav
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4748 - 4751
  • [32] Self-Supervised Visual Representations for Cross-Modal Retrieval
    Patel, Yash
    Gomez, Lluis
    Rusinol, Marcal
    Karatzas, Dimosthenis
    Jawahar, C., V
    ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 182 - 186
  • [33] Learning Where to Learn in Cross-View Self-Supervised Learning
    Huang, Lang
    You, Shan
    Zheng, Mingkai
    Wang, Fei
    Qian, Chen
    Yamasaki, Toshihiko
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 14431 - 14440
  • [34] Learning What and Where to Learn: A New Perspective on Self-Supervised Learning
    Zhao, Wenyi
    Yang, Lu
    Zhang, Weidong
    Tian, Yongqin
    Jia, Wenhe
    Li, Wei
    Yang, Mu
    Pan, Xipeng
    Yang, Huihua
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 6620 - 6633
  • [35] Scene Interpretation Method using Transformer and Self-supervised Learning
    Kobayashi, Yuya
    Suzuki, Masahiro
    Matsuo, Yutaka
    Transactions of the Japanese Society for Artificial Intelligence, 2022, 37 (02)
  • [36] Shot Contrastive Self-Supervised Learning for Scene Boundary Detection
    Chen, Shixing
    Nie, Xiaohan
    Fan, David
    Zhang, Dongqing
    Bhat, Vimal
    Hamid, Raffay
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 9791 - 9800
  • [37] Multi-Label Self-Supervised Learning with Scene Images
    Zhu, Ke
    Fu, Minghao
    Wu, Jianxin
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 6671 - 6680
  • [38] Self-Supervised Relative Depth Learning for Urban Scene Understanding
    Jiang, Huaizu
    Larsson, Gustav
    Maire, Michael
    Shakhnarovich, Greg
    Learned-Miller, Erik
    COMPUTER VISION - ECCV 2018, PT XI, 2018, 11215 : 20 - 37
  • [39] Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
    Owens, Andrew
    Efros, Alexei A.
    COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 639 - 658
  • [40] Learning Representations for New Sound Classes With Continual Self-Supervised Learning
    Wang, Zhepei
    Subakan, Cem
    Jiang, Xilin
    Wu, Junkai
    Tzinis, Efthymios
    Ravanelli, Mirco
    Smaragdis, Paris
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2607 - 2611