Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning

被引:0
|
作者
Tellamekala, Mani Kumar [1 ]
Valstar, Michel [1 ]
Pound, Michael [1 ]
Giesbrecht, Timo [2 ]
机构
[1] Univ Nottingham, Sch Comp Sci, Comp Vis Lab, Nottingham, England
[2] Unilever R&D Port Sunlight, Bebington, England
基金
英国工程与自然科学研究理事会;
关键词
D O I
10.1109/ICPR48806.2021.9413295
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-supervised learning has emerged as a candidate approach to learn semantic visual features from unlabeled video data. In self-supervised learning, intrinsic correspondences between data points are used to define a proxy task that forces the model to learn semantic representations. Most existing proxy tasks applied to video data exploit only either intra-modal (e.g. temporal) or cross-modal (e.g. audio-visual) correspondences separately. In theory, jointly learning both these correspondences may result in richer visual features; but, as we show in this work, doing so is non-trivial in practice. To address this problem, we introduce 'Audio-Visual Permutative Predictive Coding' (AV-PPC), a multi-task learning framework designed to fully leverage the temporal and cross-modal correspondences as natural supervision signals. In AV-PPC, the model is trained to simultaneously learn multiple intra- and cross-modal predictive coding sub-tasks. By using visual speech recognition (lip-reading) as the downstream evaluation task, we show that our proposed proxy task can learn higher quality visual features than existing proxy tasks. We also show that AV-PPC visual features are highly data-efficient. Without further finetuning, AV-PPC visual encoder achieves 8030% spoken word classification rate on the LRW dataset, performing on par with directly supervised visual encoders that are learned from large amounts of labeled data.
引用
收藏
页码:9912 / 9919
页数:8
相关论文
共 50 条
  • [31] Self-supervised representation learning by predicting visual permutations
    Zhao, Qilu
    Dong, Junyu
    KNOWLEDGE-BASED SYSTEMS, 2020, 210
  • [32] DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection
    Fujita, Yoto
    Bando, Yoshiaki
    Imoto, Keisuke
    Onishi, Masaki
    Yoshii, Kazuyoshi
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 2061 - 2067
  • [33] Audio-Visual Self-Supervised Terrain Type Recognition for Ground Mobile Platforms
    Kurobe, Akiyoshi
    Nakajima, Yoshikatsu
    Kitani, Kris
    Saito, Hideo
    IEEE ACCESS, 2021, 9 : 29970 - 29979
  • [34] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
    Li, Yidi
    Liu, Hong
    Tang, Hao
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
  • [35] Single-modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning
    Ishikawa, Reina
    Hachiuma, Ryo
    Kurobe, Akiyoshi
    Saito, Hideo
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9399 - 9406
  • [36] Self-Supervised Audio-Visual Feature Learning for Single-Modal Incremental Terrain Type Clustering
    Ishikawa, Reina
    Hachiuma, Ryo
    Saito, Hideo
    IEEE ACCESS, 2021, 9 : 64346 - 64357
  • [37] SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding
    Sun, Chao
    Chen, Min
    Cheng, Jialiang
    Liang, Han
    Zhu, Chuanbo
    Chen, Jincai
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 261 - 270
  • [38] Self-Supervised Visual Representation Learning from Hierarchical Grouping
    Zhang, Xiao
    Maire, Michael
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [39] Self-Supervised Visual Representation Learning via Residual Momentum
    Pham, Trung Xuan
    Niu, Axi
    Zhang, Kang
    Jin, Tee Joshua Tian
    Hong, Ji Woo
    Yoo, Chang D.
    IEEE ACCESS, 2023, 11 : 116706 - 116720
  • [40] Dense Semantic Contrast for Self-Supervised Visual Representation Learning
    Li, Xiaoni
    Zhou, Yu
    Zhang, Yifei
    Zhang, Aoting
    Wang, Wei
    Jiang, Ning
    Wu, Haiying
    Wang, Weiping
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1368 - 1376