Self-supervised object detection from audio-visual correspondence

被引:17
|
作者
Afouras, Triantafyllos [1 ,4 ]
Asano, Yuki M. [2 ]
Fagan, Francois [3 ]
Vedaldi, Andrea [3 ]
Metze, Florian [3 ]
机构
[1] Univ Oxford, Oxford, England
[2] Univ Amsterdam, Amsterdam, Netherlands
[3] Meta AI, Menlo Pk, CA USA
[4] FAIR, Oxford, England
关键词
D O I
10.1109/CVPR52688.2022.01032
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We tackle the problem of learning object detectors without supervision. Differently from weakly-supervised object detection, we do not assume image-level class labels. Instead, we extract a supervisory signal from audio-visual data, using the audio component to "teach" the object detector. While this problem is related to sound source localisation, it is considerably harder because the detector must classify the objects by type, enumerate each instance of the object, and do so even when the object is silent. We tackle this problem by first designing a self-supervised framework with a contrastive objective that jointly learns to classify and localise objects. Then, without using any supervision, we simply use these self-supervised labels and boxes to train an image-based object detector. With this, we outperform previous unsupervised and weakly-supervised detectors for the task of object detection and sound source localization. We also show that we can align this detector to ground-truth classes with as little as one label per pseudo-class, and show how our method can learn to detect generic objects that go beyond instruments, such as airplanes and cats.
引用
收藏
页码:10565 / 10576
页数:12
相关论文
共 50 条
  • [41] Audio-Visual Weakly Supervised Approach for Apathy Detection in the Elderly
    Sharma, Garima
    Joshi, Jyoti
    Zeghari, Radia
    Guerchouche, Rachid
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [42] DEPA: Self-Supervised Audio Embedding for Depression Detection
    Zhang, Pingyue
    Wu, Mengyue
    Dinkel, Heinrich
    Yu, Kai
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 135 - 143
  • [43] Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos
    Xuan, Hanyu
    Wu, Zhiliang
    Yang, Jian
    Jiang, Bo
    Luo, Lei
    Alameda-Pineda, Xavier
    Yan, Yan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (07) : 4896 - 4907
  • [44] Self-Supervised Autoencoders for Visual Anomaly Detection
    Bauer, Alexander
    Nakajima, Shinichi
    Mueller, Klaus-Robert
    MATHEMATICS, 2024, 12 (24)
  • [45] Road Condition Anomaly Detection using Self-Supervised Learning from Audio
    Gim, U-Ju
    2023 IEEE 26TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS, ITSC, 2023, : 675 - 680
  • [46] Egocentric Audio-Visual Object Localization
    Huang, Chao
    Flan, Yapeng
    Kurnar, Anurag
    Xu, Chenliang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22910 - 22921
  • [47] HASSOD: Hierarchical Adaptive Self-Supervised Object Detection
    Cao, Shengcao
    Joshi, Dhiraj
    Gui, Liang-Yan
    Wang, Yu-Xiong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [48] Learning weakly supervised audio-visual violence detection in hyperbolic space
    Zhou, Xiao
    Peng, Xiaogang
    Wen, Hao
    Luo, Yikai
    Yu, Keyang
    Yang, Ping
    Wu, Zizhao
    IMAGE AND VISION COMPUTING, 2024, 151
  • [49] Self-Supervised Reinforcement Learning for Active Object Detection
    Fang, Fen
    Liang, Wenyu
    Wu, Yan
    Xu, Qianli
    Lim, Joo-Hwee
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (04): : 10224 - 10231
  • [50] ROLL: Visual Self-Supervised Reinforcement Learning with Object Reasoning
    Wang, Yufei
    Narasimhan, Gautham Narayan
    Lin, Xingyu
    Okorn, Brian
    Held, David
    CONFERENCE ON ROBOT LEARNING, VOL 155, 2020, 155 : 1030 - 1048