Labelling unlabelled videos from scratch with multi-modal self-supervision

被引:0
|
作者
Asano, Yuki M. [1 ]
Patrick, Mandela [1 ,2 ]
Rupprecht, Christian [1 ]
Vedaldi, Andrea [1 ,2 ]
机构
[1] Univ Oxford, Visual Geometry Grp, Oxford, England
[2] Facebook AI Res, Menlo Pk, CA USA
基金
英国工程与自然科学研究理事会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A large part of the current success of deep learning lies in the effectiveness of data - more precisely: labelled data. Yet, labelling a dataset with human annotation continues to carry high costs, especially for videos. While in the image domain, recent methods have allowed to generate meaningful (pseudo-) labels for unlabelled datasets without supervision, this development is missing for the video domain where learning feature representations is the current focus. In this work, we a) show that unsupervised labelling of a video dataset does not come for free from strong feature encoders and b) propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations, by leveraging the natural correspondence between the audio and visual modalities. An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels. We further introduce the first benchmarking results on unsupervised labelling of common video datasets Kinetics, Kinetics-Sound, VGG-Sound and AVE(2).
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Multi-modal NeRF Self-Supervision for LiDAR Semantic Segmentation
    Timoneda, Xavier
    Herb, Markus
    Duerr, Fabian
    Goehring, Daniel
    Yu, Fisher
    2024 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2024), 2024, : 12939 - 12946
  • [2] SELF-SUPERVISION BY PREDICTION FOR OBJECT DISCOVERY IN VIDEOS
    Besbinar, Beril
    Frossard, Pascal
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 1509 - 1513
  • [3] Multi-modal and cross-modal for lecture videos retrieval
    Nhu Van Nguyen
    Coustaty, Mickal
    Ogier, Jean-Marc
    2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, : 2667 - 2672
  • [4] Cyclical Self-Supervision for Semi-Supervised Ejection Fraction Prediction From Echocardiogram Videos
    Dai, Weihang
    Li, Xiaomeng
    Ding, Xinpeng
    Cheng, Kwang-Ting
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2023, 42 (05) : 1446 - 1461
  • [5] DISENTANGLED SPEECH EMBEDDINGS USING CROSS-MODAL SELF-SUPERVISION
    Nagrani, Arsha
    Chung, Joon Son
    Albanie, Samuel
    Zisserman, Andrew
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6829 - 6833
  • [6] Multi-Modal Supervision Interface Concept for Marine Systems
    Nad, Dula
    Miskovic, Nikola
    Omerdic, Edin
    OCEANS 2019 - MARSEILLE, 2019,
  • [7] Learning multi-view visual correspondences with self-supervision
    Zhang, Pengcheng
    Zhou, Lei
    Bai, Xiao
    Wang, Chen
    Zhou, Jun
    Zhang, Liang
    Zheng, Jin
    DISPLAYS, 2022, 72
  • [8] Multi-Modal Scene Duplicate Detection from News Videos Focusing on Human Faces
    Kumagai, Haruka
    Ide, Ichiro
    Murase, Hiroshi
    Doman, Keisuke
    Deguchi, Daisuke
    INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2015, 9 (02) : 215 - 237
  • [9] Progressive Multi-View Human Mesh Recovery with Self-Supervision
    Gong, Xuan
    Song, Liangchen
    Zheng, Meng
    Planche, Benjamin
    Chen, Terrence
    Yuan, Junsong
    Doermann, David
    Wu, Ziyan
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 676 - 684
  • [10] Replay: Multi-modal Multi-view Acted Videos for Casual Holography
    Shapovalov, Roman
    Kleiman, Yanir
    Rocco, Ignacio
    Novotny, David
    Vedaldi, Andrea
    Chen, Changan
    Kokkinos, Filippos
    Graham, Ben
    Neverova, Natalia
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 20281 - 20291