Sound and Visual Representation Learning with Multiple Pretraining Tasks

被引:3
|
作者
Vasudevan, Arun Balajee [1 ]
Dai, Dengxin [2 ]
Van Gool, Luc [1 ,3 ]
机构
[1] Swiss Fed Inst Technol, Zurich, Switzerland
[2] MPI Informat, Saarbrucken, Germany
[3] Katholieke Univ Leuven, Leuven, Belgium
关键词
D O I
10.1109/CVPR52688.2022.01421
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Different self-supervised tasks (SSL) reveal different features from the data. The learned feature representations can exhibit different performance for each downstream task. In this light, this work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks. For this study, we investigate binaural sounds and image data. For binaural sounds, we propose three SSL tasks namely, spatial alignment, temporal synchronization of foreground objects and binaural sounds and temporal gap prediction. We investigate several approaches of Multi-SSL and give insights into the downstream task performance on video retrieval, spatial sound super resolution, and semantic prediction using OmniAudio dataset. Our experiments on binaural sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models and fully supervised models in the downstream task performance. As a check of applicability on other modalities, we also formulate our Multi-SSL models for image representation learning and we use the recently proposed SSL tasks, MoCov2 and DenseCL. Here, Multi-SSL surpasses recent methods such as MoCov2, DenseCL and DetCo by 2.06%, 3.27% and 1.19% on VOC07 classification and +2.83, +1.56 and +1.61 AP on COCO detection.
引用
收藏
页码:14596 / 14606
页数:11
相关论文
共 50 条
  • [41] Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning
    Andrew Owens
    Jiajun Wu
    Josh H. McDermott
    William T. Freeman
    Antonio Torralba
    International Journal of Computer Vision, 2018, 126 : 1120 - 1137
  • [42] Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning
    Owens, Andrew
    Wu, Jiajun
    McDermott, Josh H.
    Freeman, William T.
    Torralba, Antonio
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2018, 126 (10) : 1120 - 1137
  • [43] A Probabilistic Representation for Efficient Large Scale Visual Recognition Tasks
    Bhattacharya, Subhabrata
    Sukthankar, Rahul
    Jin, Rong
    Shah, Mubarak
    2011 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2011,
  • [44] On the effect of feedback in multilevel representation spaces for visual surveillance tasks
    Carmona, Enrique J.
    Rincon, Mariano
    Bachiller, Margarita
    Martinez-Cantos, Javier
    Martinez-Tomas, Rafael
    Mira, Jose
    NEUROCOMPUTING, 2009, 72 (4-6) : 916 - 927
  • [45] The effect of category learning on visual attention and visual representation
    Folstein, Jonathan R.
    Monfared, Shamsi S.
    Maravel, Trevor
    PSYCHOPHYSIOLOGY, 2017, 54 (12) : 1855 - 1871
  • [46] Convex Learning of Multiple Tasks and their Structure
    Ciliberto, Carlo
    Mroueh, Youssef
    Poggio, Tomaso
    Rosasco, Lorenzo
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 37, 2015, 37 : 1548 - 1557
  • [47] Learning multiple tasks with kernel methods
    Evgeniou, T
    Micchelli, CA
    Pontil, M
    JOURNAL OF MACHINE LEARNING RESEARCH, 2005, 6 : 615 - 637
  • [48] Probabilistic visual learning for object representation
    Moghaddam, B
    Pentland, A
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1997, 19 (07) : 696 - 710
  • [49] Concept Generalization in Visual Representation Learning
    Sariyildiz, Mert Bulent
    Kalantidis, Yannis
    Larlus, Diane
    Alahari, Karteek
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9609 - 9619
  • [50] The role of visual representation in the assessment of learning
    Bustle, LS
    JOURNAL OF ADOLESCENT & ADULT LITERACY, 2004, 47 (05) : 416 - 423