Labelling unlabelled videos from scratch with multi-modal self-supervision

被引：0

作者：

Asano, Yuki M. ^{[1
]}

Patrick, Mandela ^{[1
,2
]}

Rupprecht, Christian ^{[1
]}

Vedaldi, Andrea ^{[1
,2
]}

机构：

[1] Univ Oxford, Visual Geometry Grp, Oxford, England

[2] Facebook AI Res, Menlo Pk, CA USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020 | 2020年 / 33卷

基金：

英国工程与自然科学研究理事会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A large part of the current success of deep learning lies in the effectiveness of data - more precisely: labelled data. Yet, labelling a dataset with human annotation continues to carry high costs, especially for videos. While in the image domain, recent methods have allowed to generate meaningful (pseudo-) labels for unlabelled datasets without supervision, this development is missing for the video domain where learning feature representations is the current focus. In this work, we a) show that unsupervised labelling of a video dataset does not come for free from strong feature encoders and b) propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations, by leveraging the natural correspondence between the audio and visual modalities. An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels. We further introduce the first benchmarking results on unsupervised labelling of common video datasets Kinetics, Kinetics-Sound, VGG-Sound and AVE(2).

引用

页数：12

共 50 条

[21] Ternary Adversarial Networks With Self-Supervision for Zero-Shot Cross-Modal Retrieval
Xu, Xing
Lu, Huimin
Song, Jingkuan
Yang, Yang
Shen, Heng Tao
Li, Xuelong
IEEE TRANSACTIONS ON CYBERNETICS, 2020, 50 (06) : 2400 - 2413
[22] On the Effects of Self-supervision and Contrastive Alignment in Deep Multi-view Clustering
Trosten, Daniel J.
Lokse, Sigurd
Jenssen, Robert
Kampffmeyer, Michael C.
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23976 - 23985
[23] Deep Spatial Prediction via Heterogeneous Multi-source Self-supervision
Zhang, Minxing
Yu, Dazhou
Li, Yun
Zhao, Liang
ACM TRANSACTIONS ON SPATIAL ALGORITHMS AND SYSTEMS, 2023, 9 (03)
[24] Unveiling the Power of Self-Supervision for Multi-View Multi-Human Association and Tracking
Feng, Wei
Wang, Feifan
Han, Ruize
Gan, Yiyang
Qian, Zekun
Hou, Junhui
Wang, Song
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (01) : 351 - 368
[25] Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos
Acar, Esra
Hopfgartner, Frank
Albayrak, Sahin
2015 13TH INTERNATIONAL WORKSHOP ON CONTENT-BASED MULTIMEDIA INDEXING (CBMI), 2015,
[26] SymforNet: application of cross-modal information correspondences based on self-supervision in symbolic music generation
Abudukelimu, Halidanmu
Chen, Jishang
Liang, Yunze
Abulizi, Abudukelimu
Yasen, Alimujiang
APPLIED INTELLIGENCE, 2024, 54 (05) : 4140 - 4152
[27] Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos
Zhang, Zongmeng
Han, Xianjing
Song, Xuemeng
Yan, Yan
Nie, Liqiang
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 8265 - 8277
[28] Co-learning: Learning from Noisy Labels with Self-supervision
Tan, Cheng
Xia, Jun
Wu, Lirong
Li, Stan Z.
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1405 - 1413
[29] TRACK: A MULTI-MODAL DEEP ARCHITECTURE FOR HEAD MOTION PREDICTION IN 360° VIDEOS
Rondon, Miguel Fabian Romero
Sassatelli, Lucile
Pardo, Ramon Aparicio
Precioso, Frederic
2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 2586 - 2590
[30] MULTI-MODAL TOPIC UNIT SEGMENTATION IN VIDEOS USING CONDITIONAL RANDOM FIELDS
Xu, Su
Feng, Bailan
Xu, Bo
2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 2287 - 2291

← 1 2 3 4 5 →