Sound and Visual Representation Learning with Multiple Pretraining Tasks

被引:3
|
作者
Vasudevan, Arun Balajee [1 ]
Dai, Dengxin [2 ]
Van Gool, Luc [1 ,3 ]
机构
[1] Swiss Fed Inst Technol, Zurich, Switzerland
[2] MPI Informat, Saarbrucken, Germany
[3] Katholieke Univ Leuven, Leuven, Belgium
关键词
D O I
10.1109/CVPR52688.2022.01421
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Different self-supervised tasks (SSL) reveal different features from the data. The learned feature representations can exhibit different performance for each downstream task. In this light, this work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks. For this study, we investigate binaural sounds and image data. For binaural sounds, we propose three SSL tasks namely, spatial alignment, temporal synchronization of foreground objects and binaural sounds and temporal gap prediction. We investigate several approaches of Multi-SSL and give insights into the downstream task performance on video retrieval, spatial sound super resolution, and semantic prediction using OmniAudio dataset. Our experiments on binaural sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models and fully supervised models in the downstream task performance. As a check of applicability on other modalities, we also formulate our Multi-SSL models for image representation learning and we use the recently proposed SSL tasks, MoCov2 and DenseCL. Here, Multi-SSL surpasses recent methods such as MoCov2, DenseCL and DetCo by 2.06%, 3.27% and 1.19% on VOC07 classification and +2.83, +1.56 and +1.61 AP on COCO detection.
引用
收藏
页码:14596 / 14606
页数:11
相关论文
共 50 条
  • [31] TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation
    Rahman, Tanzila
    Yang, Mengyu
    Sigal, Leonid
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [32] Video Pretraining Advances 3D Deep Learning on Chest CT Tasks
    Ke, Alexander
    Huang, Shih-Cheng
    O'Conne, Chloe
    Klimont, Michal
    Yeung, Serena
    Rajpurkar, Pranav
    MEDICAL IMAGING WITH DEEP LEARNING, VOL 227, 2023, 227 : 758 - 774
  • [33] Low Level Visual Feature Extraction by Learning of Multiple Tasks for Convolutional Neural Networks
    Ide, Hidenori
    Kurita, Takio
    2016 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2016, : 3620 - 3627
  • [34] Deep contrastive representation learning for supervised tasks
    Duan, Chenguang
    Jiao, Yuling
    Kang, Lican
    Yang, Jerry Zhijian
    Zhou, Fusheng
    PATTERN RECOGNITION, 2025, 161
  • [35] Gram matrix: an efficient representation of molecular conformation and learning objective for molecular pretraining
    Xiang, Wenkai
    Zhong, Feisheng
    Ni, Lin
    Zheng, Mingyue
    Li, Xutong
    Shi, Qian
    Wang, Dingyan
    BRIEFINGS IN BIOINFORMATICS, 2024, 25 (04)
  • [36] SOUND DISCRIMINATION AS A FUNCTION OF PRETRAINING CONDITIONS
    WINITZ, H
    BELLEROSE, B
    JOURNAL OF SPEECH AND HEARING RESEARCH, 1962, 5 (04): : 340 - 348
  • [37] Mutual Contrastive Learning for Visual Representation Learning
    Yang, Chuanguang
    An, Zhulin
    Cai, Linhang
    Xu, Yongjun
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3045 - 3053
  • [38] Notation as visual representation of sound-based music
    Skold, Mattias
    JOURNAL OF NEW MUSIC RESEARCH, 2022, 51 (2-3) : 186 - 202
  • [39] SYNESTHESIA IN GRAPHIC NOTATION: VISUAL LANGUAGES FOR THE REPRESENTATION OF SOUND
    Buj Corral, Marina
    CUADERNOS DE MUSICA ARTES VISUALES Y ARTES ESCENICAS, 2019, 14 (01): : 45 - 64
  • [40] Integration of Multiple Visual Tasks in a Robotic System
    Hernandez, Daniel
    Cabrera, Jorge
    Naranjo, Angel
    Dominguez, Antonio
    Isern, Josep
    CISCI 2007: 6TA CONFERENCIA IBEROAMERICANA EN SISTEMAS, CIBERNETICA E INFORMATICA, MEMORIAS, VOL I, 2007, : 183 - 188