Labelling unlabelled videos from scratch with multi-modal self-supervision

被引：0

作者：

Asano, Yuki M. ^{[1
]}

Patrick, Mandela ^{[1
,2
]}

Rupprecht, Christian ^{[1
]}

Vedaldi, Andrea ^{[1
,2
]}

机构：

[1] Univ Oxford, Visual Geometry Grp, Oxford, England

[2] Facebook AI Res, Menlo Pk, CA USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020 | 2020年 / 33卷

基金：

英国工程与自然科学研究理事会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A large part of the current success of deep learning lies in the effectiveness of data - more precisely: labelled data. Yet, labelling a dataset with human annotation continues to carry high costs, especially for videos. While in the image domain, recent methods have allowed to generate meaningful (pseudo-) labels for unlabelled datasets without supervision, this development is missing for the video domain where learning feature representations is the current focus. In this work, we a) show that unsupervised labelling of a video dataset does not come for free from strong feature encoders and b) propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations, by leveraging the natural correspondence between the audio and visual modalities. An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels. We further introduce the first benchmarking results on unsupervised labelling of common video datasets Kinetics, Kinetics-Sound, VGG-Sound and AVE(2).

引用

页数：12

共 50 条

[31] ToolBot: Learning Oriented Keypoints for Tool Usage From Self-Supervision
Wei, Junhang
Hao, Peng
Wang, Shuo
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, 20 (01) : 723 - 731
[32] Development of Multi-Modal Control Programs for Continuous-Discrete Process Supervision
De Paula, Mariano
Martinez, Ernesto
10TH INTERNATIONAL SYMPOSIUM ON PROCESS SYSTEMS ENGINEERING, 2009, 27 : 1383 - 1388
[33] The Value of Mixing It Up: Student Experiences of a Multi-Modal Approach to Supervision on Placement
Vassos, Sevi
Harms, Louise
Rose, David
BRITISH JOURNAL OF SOCIAL WORK, 2019, 49 (05): : 1274 - 1295
[34] SELF-AUGMENTED MULTI-MODAL FEATURE EMBEDDING
Matsuo, Shinnosuke
Uchida, Seiichi
Iwana, Brian Kenji
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3995 - 3999
[35] Reading Between the Frames: Multi-modal Depression Detection in Videos from Non-verbal Cues
Gimeno-Gomez, David
Bucur, Ana-Maria
Cosma, Adrian
Martinez-Hinarejos, Carlos-David
Rosso, Paolo
ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 191 - 209
[36] Multi-modal trajectory forecasting with Multi-scale Interactions and Multi-pseudo-target Supervision
Zhao, Cong
Song, Andi
Zeng, Zimu
Ji, Yuxiong
Du, Yuchuan
KNOWLEDGE-BASED SYSTEMS, 2024, 296
[37] Multi-Task Multi-modal Semantic Hashing for Web Image Retrieval with Limited Supervision
Xie, Liang
Zhu, Lei
Cheng, Zhiyong
MULTIMEDIA MODELING (MMM 2017), PT I, 2017, 10132 : 465 - 477
[38] Once and for All: Self-supervised Multi-modal Co-training on One-billion Videos at Alibaba
Huang, Lianghua
Liu, Yu
Zhou, Xiangzeng
You, Ansheng
Li, Ming
Wang, Bin
Zhang, Yingya
Pan, Pan
Xu, Yinghui
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1148 - 1156
[39] Learning Image Inpainting from Incomplete Images using Self-Supervision
Yenamandra, Sriram
Khurana, Ansh
Jena, Rohit
Awate, Suyash P.
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 10390 - 10397
[40] SymforNet: application of cross-modal information correspondences based on self-supervision in symbolic music generation
Halidanmu Abudukelimu
Jishang Chen
Yunze Liang
Abudukelimu Abulizi
Alimujiang Yasen
Applied Intelligence, 2024, 54 : 4140 - 4152

← 1 2 3 4 5 →