Labelling unlabelled videos from scratch with multi-modal self-supervision

被引:0
|
作者
Asano, Yuki M. [1 ]
Patrick, Mandela [1 ,2 ]
Rupprecht, Christian [1 ]
Vedaldi, Andrea [1 ,2 ]
机构
[1] Univ Oxford, Visual Geometry Grp, Oxford, England
[2] Facebook AI Res, Menlo Pk, CA USA
基金
英国工程与自然科学研究理事会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A large part of the current success of deep learning lies in the effectiveness of data - more precisely: labelled data. Yet, labelling a dataset with human annotation continues to carry high costs, especially for videos. While in the image domain, recent methods have allowed to generate meaningful (pseudo-) labels for unlabelled datasets without supervision, this development is missing for the video domain where learning feature representations is the current focus. In this work, we a) show that unsupervised labelling of a video dataset does not come for free from strong feature encoders and b) propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations, by leveraging the natural correspondence between the audio and visual modalities. An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels. We further introduce the first benchmarking results on unsupervised labelling of common video datasets Kinetics, Kinetics-Sound, VGG-Sound and AVE(2).
引用
收藏
页数:12
相关论文
共 50 条
  • [31] ToolBot: Learning Oriented Keypoints for Tool Usage From Self-Supervision
    Wei, Junhang
    Hao, Peng
    Wang, Shuo
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, 20 (01) : 723 - 731
  • [32] Development of Multi-Modal Control Programs for Continuous-Discrete Process Supervision
    De Paula, Mariano
    Martinez, Ernesto
    10TH INTERNATIONAL SYMPOSIUM ON PROCESS SYSTEMS ENGINEERING, 2009, 27 : 1383 - 1388
  • [33] The Value of Mixing It Up: Student Experiences of a Multi-Modal Approach to Supervision on Placement
    Vassos, Sevi
    Harms, Louise
    Rose, David
    BRITISH JOURNAL OF SOCIAL WORK, 2019, 49 (05): : 1274 - 1295
  • [34] SELF-AUGMENTED MULTI-MODAL FEATURE EMBEDDING
    Matsuo, Shinnosuke
    Uchida, Seiichi
    Iwana, Brian Kenji
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3995 - 3999
  • [35] Reading Between the Frames: Multi-modal Depression Detection in Videos from Non-verbal Cues
    Gimeno-Gomez, David
    Bucur, Ana-Maria
    Cosma, Adrian
    Martinez-Hinarejos, Carlos-David
    Rosso, Paolo
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 191 - 209
  • [36] Multi-modal trajectory forecasting with Multi-scale Interactions and Multi-pseudo-target Supervision
    Zhao, Cong
    Song, Andi
    Zeng, Zimu
    Ji, Yuxiong
    Du, Yuchuan
    KNOWLEDGE-BASED SYSTEMS, 2024, 296
  • [37] Multi-Task Multi-modal Semantic Hashing for Web Image Retrieval with Limited Supervision
    Xie, Liang
    Zhu, Lei
    Cheng, Zhiyong
    MULTIMEDIA MODELING (MMM 2017), PT I, 2017, 10132 : 465 - 477
  • [38] Once and for All: Self-supervised Multi-modal Co-training on One-billion Videos at Alibaba
    Huang, Lianghua
    Liu, Yu
    Zhou, Xiangzeng
    You, Ansheng
    Li, Ming
    Wang, Bin
    Zhang, Yingya
    Pan, Pan
    Xu, Yinghui
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1148 - 1156
  • [39] Learning Image Inpainting from Incomplete Images using Self-Supervision
    Yenamandra, Sriram
    Khurana, Ansh
    Jena, Rohit
    Awate, Suyash P.
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 10390 - 10397
  • [40] SymforNet: application of cross-modal information correspondences based on self-supervision in symbolic music generation
    Halidanmu Abudukelimu
    Jishang Chen
    Yunze Liang
    Abudukelimu Abulizi
    Alimujiang Yasen
    Applied Intelligence, 2024, 54 : 4140 - 4152