Repeat and learn: Self-supervised visual representations learning by Scene Localization

被引：1

作者：

Altabrawee, Hussein ^{[1
,2
]}

Noor, Mohd Halim Mohd ^{[1
]}

机构：

[1] Univ Sains Malaysia, Sch Comp Sci, Main Campus, Gelugor 11800, Penang, Malaysia

[2] Al Muthanna Univ, Comp Ctr, Main Campus, Samawah 66001, Al Muthanna, Iraq

来源：

PATTERN RECOGNITION | 2024年 / 156卷

关键词：

Visual representations learning; Action recognition; Self-supervised learning;

D O I：

10.1016/j.patcog.2024.110804

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large labeled datasets are crucial for video understanding progress. However, the labeling process is timeconsuming, expensive, and tiresome. To overcome this impediment, various pretexts use the temporal coherence in videos to learn visual representations in a self-supervised manner. However, these pretexts (order verification and sequence sorting) struggle when encountering cyclic actions due to the label ambiguity problem. To overcome these limitations, we present a novel temporal pretext task to address self-supervised learning of visual representations from unlabeled videos. Repeated Scene Localization (RSL) is a multi-class classification pretext that involves changing the temporal order of the frames in a video by repeating a scene. Then, the network is trained to identify the modified video, localize the location of the repeated scene, and identify the unmodified original videos that do not have repeated scenes. We evaluated the proposed pretext on two benchmark datasets, UCF-101 and HMDB-51. The experimental results show that the proposed pretext achieves state-of-the-art results in action recognition and video retrieval tasks. In action recognition, our S3D model achieves 88.15% and 56.86% on UCF-101 and HMDB-51, respectively. It outperforms the current state-of-the-art by 1.05% and 3.26%. Our R(2+1)D-Adjacent model achieves 83.52% and 54.50% on UCF-101 and HMDB-51, respectively. It outperforms the single pretext tasks by 8.7% and 13.9%. In video retrieval, our R(2+1)D-Offset model outperforms the single pretext tasks by 4.68% and 1.1% Top 1 accuracies on UCF-101 and HMDB-51, respectively. The source code and the trained models are publicly available at https://github.com/Hussein-A-Hassan/RSL-Pretext.

引用

页数：10

共 50 条

[41] CutPaste: Self-Supervised Learning for Anomaly Detection and Localization
Li, Chun-Liang
Sohn, Kihyuk
Yoon, Jinsung
Pfister, Tomas
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 9659 - 9669
[42] Mixed Autoencoder for Self-supervised Visual Representation Learning
Chen, Kai
Liu, Zhili
Hong, Lanqing
Xu, Hang
Li, Zhenguo
Yeung, Dit-Yan
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22742 - 22751
[43] Self-supervised Learning of Contextualized Local Visual Embeddings
Silva, Thalles
Pedrini, Helio
Rivera, Adin Ramirez
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 177 - 186
[44] A survey on self-supervised methods for visual representation learning
Uelwer, Tobias
Robine, Jan
Wagner, Stefan Sylvius
Hoeftmann, Marc
Upschulte, Eric
Konietzny, Sebastian
Behrendt, Maike
Harmeling, Stefan
MACHINE LEARNING, 2025, 114 (04)
[45] Multi-task Self-Supervised Visual Learning
Doersch, Carl
Zisserman, Andrew
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2070 - 2079
[46] Self-supervised Visual Learning from Interactions with Objects
Aubret, Arthur
Teuliere, Celine
Triesch, Jochen
COMPUTER VISION - ECCV 2024, PT LXXV, 2025, 15133 : 54 - 71
[47] Scaling and Benchmarking Self-Supervised Visual Representation Learning
Goyal, Priya
Mahajan, Dhruv
Gupta, Abhinav
Misra, Ishan
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6400 - 6409
[48] A Survey on Masked Autoencoder for Visual Self-supervised Learning
Zhang, Chaoning
Zhang, Chenshuang
Song, Junha
Yi, John Seon Keun
Kweon, In So
PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 6805 - 6813
[49] Transitive Invariance for Self-supervised Visual Representation Learning
Wang, Xiaolong
He, Kaiming
Gupta, Abhinav
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1338 - 1347
[50] Self-supervised Visual Representation Learning for Histopathological Images
Yang, Pengshuai
Hong, Zhiwei
Yin, Xiaoxu
Zhu, Chengzhan
Jiang, Rui
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT II, 2021, 12902 : 47 - 57

← 1 2 3 4 5 →