Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning

被引：0

作者：

Tellamekala, Mani Kumar ^{[1
]}

Valstar, Michel ^{[1
]}

Pound, Michael ^{[1
]}

Giesbrecht, Timo ^{[2
]}

机构：

[1] Univ Nottingham, Sch Comp Sci, Comp Vis Lab, Nottingham, England

[2] Unilever R&D Port Sunlight, Bebington, England

来源：

2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR) | 2021年

基金：

英国工程与自然科学研究理事会;

关键词：

D O I：

10.1109/ICPR48806.2021.9413295

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Self-supervised learning has emerged as a candidate approach to learn semantic visual features from unlabeled video data. In self-supervised learning, intrinsic correspondences between data points are used to define a proxy task that forces the model to learn semantic representations. Most existing proxy tasks applied to video data exploit only either intra-modal (e.g. temporal) or cross-modal (e.g. audio-visual) correspondences separately. In theory, jointly learning both these correspondences may result in richer visual features; but, as we show in this work, doing so is non-trivial in practice. To address this problem, we introduce 'Audio-Visual Permutative Predictive Coding' (AV-PPC), a multi-task learning framework designed to fully leverage the temporal and cross-modal correspondences as natural supervision signals. In AV-PPC, the model is trained to simultaneously learn multiple intra- and cross-modal predictive coding sub-tasks. By using visual speech recognition (lip-reading) as the downstream evaluation task, we show that our proposed proxy task can learn higher quality visual features than existing proxy tasks. We also show that AV-PPC visual features are highly data-efficient. Without further finetuning, AV-PPC visual encoder achieves 8030% spoken word classification rate on the LRW dataset, performing on par with directly supervised visual encoders that are learned from large amounts of labeled data.

引用

页码：9912 / 9919

页数：8

共 50 条

[31] Self-supervised representation learning by predicting visual permutations
Zhao, Qilu
Dong, Junyu
KNOWLEDGE-BASED SYSTEMS, 2020, 210
[32] DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection
Fujita, Yoto
Bando, Yoshiaki
Imoto, Keisuke
Onishi, Masaki
Yoshii, Kazuyoshi
2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 2061 - 2067
[33] Audio-Visual Self-Supervised Terrain Type Recognition for Ground Mobile Platforms
Kurobe, Akiyoshi
Nakajima, Yoshikatsu
Kitani, Kris
Saito, Hideo
IEEE ACCESS, 2021, 9 : 29970 - 29979
[34] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
Li, Yidi
Liu, Hong
Tang, Hao
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
[35] Single-modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning
Ishikawa, Reina
Hachiuma, Ryo
Kurobe, Akiyoshi
Saito, Hideo
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9399 - 9406
[36] Self-Supervised Audio-Visual Feature Learning for Single-Modal Incremental Terrain Type Clustering
Ishikawa, Reina
Hachiuma, Ryo
Saito, Hideo
IEEE ACCESS, 2021, 9 : 64346 - 64357
[37] SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding
Sun, Chao
Chen, Min
Cheng, Jialiang
Liang, Han
Zhu, Chuanbo
Chen, Jincai
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 261 - 270
[38] Self-Supervised Visual Representation Learning from Hierarchical Grouping
Zhang, Xiao
Maire, Michael
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[39] Self-Supervised Visual Representation Learning via Residual Momentum
Pham, Trung Xuan
Niu, Axi
Zhang, Kang
Jin, Tee Joshua Tian
Hong, Ji Woo
Yoo, Chang D.
IEEE ACCESS, 2023, 11 : 116706 - 116720
[40] Dense Semantic Contrast for Self-Supervised Visual Representation Learning
Li, Xiaoni
Zhou, Yu
Zhang, Yifei
Zhang, Aoting
Wang, Wei
Jiang, Ning
Wu, Haiying
Wang, Weiping
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1368 - 1376

← 1 2 3 4 5 →