Sound and Visual Representation Learning with Multiple Pretraining Tasks

被引:3
|
作者
Vasudevan, Arun Balajee [1 ]
Dai, Dengxin [2 ]
Van Gool, Luc [1 ,3 ]
机构
[1] Swiss Fed Inst Technol, Zurich, Switzerland
[2] MPI Informat, Saarbrucken, Germany
[3] Katholieke Univ Leuven, Leuven, Belgium
关键词
D O I
10.1109/CVPR52688.2022.01421
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Different self-supervised tasks (SSL) reveal different features from the data. The learned feature representations can exhibit different performance for each downstream task. In this light, this work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks. For this study, we investigate binaural sounds and image data. For binaural sounds, we propose three SSL tasks namely, spatial alignment, temporal synchronization of foreground objects and binaural sounds and temporal gap prediction. We investigate several approaches of Multi-SSL and give insights into the downstream task performance on video retrieval, spatial sound super resolution, and semantic prediction using OmniAudio dataset. Our experiments on binaural sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models and fully supervised models in the downstream task performance. As a check of applicability on other modalities, we also formulate our Multi-SSL models for image representation learning and we use the recently proposed SSL tasks, MoCov2 and DenseCL. Here, Multi-SSL surpasses recent methods such as MoCov2, DenseCL and DetCo by 2.06%, 3.27% and 1.19% on VOC07 classification and +2.83, +1.56 and +1.61 AP on COCO detection.
引用
收藏
页码:14596 / 14606
页数:11
相关论文
共 50 条
  • [11] Harmonograph :A Visual Representation of Sound
    Kim, Eugene Mikyung
    IEEE INTERNATIONAL SYMPOSIUM ON INTELLIGENT SIGNAL PROCESSING AND COMMUNICATIONS SYSTEMS (ISPACS 2012), 2012,
  • [12] X-Learner: Learning Cross Sources and Tasks for Universal Visual Representation
    He, Yinan
    Huang, Gengshi
    Chen, Siyu
    Teng, Jianing
    Wang, Kun
    Yin, Zhenfei
    Sheng, Lu
    Liu, Ziwei
    Qiao, Yu
    Shao, Jing
    COMPUTER VISION, ECCV 2022, PT XXVI, 2022, 13686 : 509 - 528
  • [13] Representation Learning for Underdefined Tasks
    Bonastre, Jean-Francois
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS (CIARP 2019), 2019, 11896 : 42 - 47
  • [14] PRETRAINING RESPIRATORY SOUND REPRESENTATIONS USING METADATA AND CONTRASTIVE LEARNING
    Moummad, Ilyass
    Farrugia, Nicolas
    2023 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, WASPAA, 2023,
  • [15] Look at that Sound!-Visual Representation of Sound in Indian Comics
    Dey, Subir
    Bokil, Prasad
    RESEARCH INTO DESIGN FOR COMMUNITIES, VOL 2: PROCEEDINGS OF ICORD 2017, 2017, 66 : 821 - 832
  • [16] Multiple kernel visual-auditory representation learning for retrieval
    Zhang, Hong
    Zhang, Wenping
    Liu, Wenhe
    Xu, Xin
    Fan, Hehe
    MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (15) : 9169 - 9184
  • [17] Multiple kernel visual-auditory representation learning for retrieval
    Hong Zhang
    Wenping Zhang
    Wenhe Liu
    Xin Xu
    Hehe Fan
    Multimedia Tools and Applications, 2016, 75 : 9169 - 9184
  • [18] Heuristic Attention Representation Learning for Self-Supervised Pretraining
    Van Nhiem Tran
    Liu, Shen-Hsuan
    Li, Yung-Hui
    Wang, Jia-Ching
    SENSORS, 2022, 22 (14)
  • [19] Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications
    Manzoor, Muhammad Arslan
    Albarri, Sarah
    Xian, Ziting
    Meng, Zaiqiao
    Nakov, Preslav
    Liang, Shangsong
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (03)
  • [20] Learning List-wise Representation in Reinforcement Learning for Ads Allocation with Multiple Auxiliary Tasks
    Wang, Ze
    Liao, Guogang
    Shi, Xiaowen
    Wu, Xiaoxu
    Zhang, Chuheng
    Wang, Yongkang
    Wang, Xingxing
    Wang, Dong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 3555 - 3564