Self-supervised object detection from audio-visual correspondence

被引：17

作者：

Afouras, Triantafyllos ^{[1
,4
]}

Asano, Yuki M. ^{[2
]}

Fagan, Francois ^{[3
]}

Vedaldi, Andrea ^{[3
]}

Metze, Florian ^{[3
]}

机构：

[1] Univ Oxford, Oxford, England

[2] Univ Amsterdam, Amsterdam, Netherlands

[3] Meta AI, Menlo Pk, CA USA

[4] FAIR, Oxford, England

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.01032

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We tackle the problem of learning object detectors without supervision. Differently from weakly-supervised object detection, we do not assume image-level class labels. Instead, we extract a supervisory signal from audio-visual data, using the audio component to "teach" the object detector. While this problem is related to sound source localisation, it is considerably harder because the detector must classify the objects by type, enumerate each instance of the object, and do so even when the object is silent. We tackle this problem by first designing a self-supervised framework with a contrastive objective that jointly learns to classify and localise objects. Then, without using any supervision, we simply use these self-supervised labels and boxes to train an image-based object detector. With this, we outperform previous unsupervised and weakly-supervised detectors for the task of object detection and sound source localization. We also show that we can align this detector to ground-truth classes with as little as one label per pseudo-class, and show how our method can learn to detect generic objects that go beyond instruments, such as airplanes and cats.

引用

页码：10565 / 10576

页数：12

共 50 条

[31] Self-Supervised Object Detection from Egocentric Videos
Akiva, Peri
Huang, Jing
Liang, Kevin J.
Kovvuri, Rama
Chen, Xingyu
Feiszli, Matt
Dana, Kristin
Hassner, Tal
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5202 - 5214
[32] Self-supervised Spoofing Audio Detection Scheme
Jiang, Ziyue
Zhu, Hongcheng
Peng, Li
Ding, Wenbing
Ren, Yanzhen
INTERSPEECH 2020, 2020, : 4223 - 4227
[33] Self-Supervised Visual Descriptor Learning for Dense Correspondence
Schmidt, Tanner
Newcombe, Richard
Fox, Dieter
IEEE ROBOTICS AND AUTOMATION LETTERS, 2017, 2 (02): : 420 - 427
[34] Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization
Liu, Tianyu
Zhang, Peng
Huang, Wei
Zha, Yufei
You, Tao
Zhang, Yanning
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4042 - 4052
[35] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
Li, Yidi
Liu, Hong
Tang, Hao
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
[36] Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
Cheng, Ying
Wang, Ruize
Pan, Zhihao
Feng, Rui
Zhang, Yuejie
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3884 - 3892
[37] Self-Supervised Audio-Visual Feature Learning for Single-Modal Incremental Terrain Type Clustering
Ishikawa, Reina
Hachiuma, Ryo
Saito, Hideo
IEEE ACCESS, 2021, 9 : 64346 - 64357
[38] Object category detection using audio-visual cues
Luo, Jie
Caputo, Barbara
Zweig, Alon
Bach, Joerg-Hendrik
Anemueller, Joern
COMPUTER VISION SYSTEMS, PROCEEDINGS, 2008, 5008 : 539 - 548
[39] Temporal structure and complexity affect audio-visual correspondence detection
Denison, Rachel N.
Driver, Jon
Ruff, Christian C.
FRONTIERS IN PSYCHOLOGY, 2013, 3
[40] Object Detection with Self-Supervised Scene Adaptation
Zhang, Zekun
Hoai, Minh
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 21589 - 21599

← 1 2 3 4 5 →