Labelling unlabelled videos from scratch with multi-modal self-supervision

被引:0
|
作者
Asano, Yuki M. [1 ]
Patrick, Mandela [1 ,2 ]
Rupprecht, Christian [1 ]
Vedaldi, Andrea [1 ,2 ]
机构
[1] Univ Oxford, Visual Geometry Grp, Oxford, England
[2] Facebook AI Res, Menlo Pk, CA USA
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020 | 2020年 / 33卷
基金
英国工程与自然科学研究理事会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A large part of the current success of deep learning lies in the effectiveness of data - more precisely: labelled data. Yet, labelling a dataset with human annotation continues to carry high costs, especially for videos. While in the image domain, recent methods have allowed to generate meaningful (pseudo-) labels for unlabelled datasets without supervision, this development is missing for the video domain where learning feature representations is the current focus. In this work, we a) show that unsupervised labelling of a video dataset does not come for free from strong feature encoders and b) propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations, by leveraging the natural correspondence between the audio and visual modalities. An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels. We further introduce the first benchmarking results on unsupervised labelling of common video datasets Kinetics, Kinetics-Sound, VGG-Sound and AVE(2).
引用
收藏
页数:12
相关论文
共 50 条
  • [21] Ternary Adversarial Networks With Self-Supervision for Zero-Shot Cross-Modal Retrieval
    Xu, Xing
    Lu, Huimin
    Song, Jingkuan
    Yang, Yang
    Shen, Heng Tao
    Li, Xuelong
    IEEE TRANSACTIONS ON CYBERNETICS, 2020, 50 (06) : 2400 - 2413
  • [22] On the Effects of Self-supervision and Contrastive Alignment in Deep Multi-view Clustering
    Trosten, Daniel J.
    Lokse, Sigurd
    Jenssen, Robert
    Kampffmeyer, Michael C.
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23976 - 23985
  • [23] Deep Spatial Prediction via Heterogeneous Multi-source Self-supervision
    Zhang, Minxing
    Yu, Dazhou
    Li, Yun
    Zhao, Liang
    ACM TRANSACTIONS ON SPATIAL ALGORITHMS AND SYSTEMS, 2023, 9 (03)
  • [24] Unveiling the Power of Self-Supervision for Multi-View Multi-Human Association and Tracking
    Feng, Wei
    Wang, Feifan
    Han, Ruize
    Gan, Yiyang
    Qian, Zekun
    Hou, Junhui
    Wang, Song
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (01) : 351 - 368
  • [25] Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos
    Acar, Esra
    Hopfgartner, Frank
    Albayrak, Sahin
    2015 13TH INTERNATIONAL WORKSHOP ON CONTENT-BASED MULTIMEDIA INDEXING (CBMI), 2015,
  • [26] SymforNet: application of cross-modal information correspondences based on self-supervision in symbolic music generation
    Abudukelimu, Halidanmu
    Chen, Jishang
    Liang, Yunze
    Abulizi, Abudukelimu
    Yasen, Alimujiang
    APPLIED INTELLIGENCE, 2024, 54 (05) : 4140 - 4152
  • [27] Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos
    Zhang, Zongmeng
    Han, Xianjing
    Song, Xuemeng
    Yan, Yan
    Nie, Liqiang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 8265 - 8277
  • [28] Co-learning: Learning from Noisy Labels with Self-supervision
    Tan, Cheng
    Xia, Jun
    Wu, Lirong
    Li, Stan Z.
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1405 - 1413
  • [29] TRACK: A MULTI-MODAL DEEP ARCHITECTURE FOR HEAD MOTION PREDICTION IN 360° VIDEOS
    Rondon, Miguel Fabian Romero
    Sassatelli, Lucile
    Pardo, Ramon Aparicio
    Precioso, Frederic
    2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 2586 - 2590
  • [30] MULTI-MODAL TOPIC UNIT SEGMENTATION IN VIDEOS USING CONDITIONAL RANDOM FIELDS
    Xu, Su
    Feng, Bailan
    Xu, Bo
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 2287 - 2291