Sound and Visual Representation Learning with Multiple Pretraining Tasks

被引:3
|
作者
Vasudevan, Arun Balajee [1 ]
Dai, Dengxin [2 ]
Van Gool, Luc [1 ,3 ]
机构
[1] Swiss Fed Inst Technol, Zurich, Switzerland
[2] MPI Informat, Saarbrucken, Germany
[3] Katholieke Univ Leuven, Leuven, Belgium
关键词
D O I
10.1109/CVPR52688.2022.01421
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Different self-supervised tasks (SSL) reveal different features from the data. The learned feature representations can exhibit different performance for each downstream task. In this light, this work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks. For this study, we investigate binaural sounds and image data. For binaural sounds, we propose three SSL tasks namely, spatial alignment, temporal synchronization of foreground objects and binaural sounds and temporal gap prediction. We investigate several approaches of Multi-SSL and give insights into the downstream task performance on video retrieval, spatial sound super resolution, and semantic prediction using OmniAudio dataset. Our experiments on binaural sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models and fully supervised models in the downstream task performance. As a check of applicability on other modalities, we also formulate our Multi-SSL models for image representation learning and we use the recently proposed SSL tasks, MoCov2 and DenseCL. Here, Multi-SSL surpasses recent methods such as MoCov2, DenseCL and DetCo by 2.06%, 3.27% and 1.19% on VOC07 classification and +2.83, +1.56 and +1.61 AP on COCO detection.
引用
收藏
页码:14596 / 14606
页数:11
相关论文
共 50 条
  • [21] LEARNING EARLY VISUAL TASKS
    POGGIO, TA
    HURLBERT, AC
    INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 1992, 33 (04) : 826 - 826
  • [22] A Uniform Representation for Trajectory Learning Tasks
    Li, Qingzhe
    Lin, Jessica
    Zhao, Liang
    Rangwala, Huzefa
    25TH ACM SIGSPATIAL INTERNATIONAL CONFERENCE ON ADVANCES IN GEOGRAPHIC INFORMATION SYSTEMS (ACM SIGSPATIAL GIS 2017), 2017,
  • [23] Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics
    Guo, Xiaoyuan
    Duan, Jiali
    Kuo, C. -C. Jay
    Gichoya, Judy Wawira
    Banerjee, Imon
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 4779 - 4785
  • [24] Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining
    Bugliarello, Emanuele
    Nematzadeh, Aida
    Hendricks, Lisa Anne
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 3052 - 3071
  • [25] LEARNING SET IN RATS - RELATIVE EFFECTIVENESS OF PRETRAINING TO CRITERION ON RELATED AND UNRELATED TASKS
    MILLS, JA
    MANOCHA, SN
    WINOCUR, G
    CANADIAN PSYCHOLOGIST-PSYCHOLOGIE CANADIENNE, 1969, 10 (02): : 210 - &
  • [26] Graphs get personal: learning representation with contextual pretraining for collaborative filtering
    Tiesunlong Shen
    You Zhang
    Jin Wang
    Xuejie Zhang
    Applied Intelligence, 2023, 53 : 30416 - 30430
  • [27] Graphs get personal: learning representation with contextual pretraining for collaborative filtering
    Shen, Tiesunlong
    Zhang, You
    Wang, Jin
    Zhang, Xuejie
    APPLIED INTELLIGENCE, 2023, 53 (24) : 30400 - 30415
  • [28] Sound facilitates visual learning
    Seitz, Aaron R.
    Kim, Robyn
    Shams, Ladan
    CURRENT BIOLOGY, 2006, 16 (14) : 1422 - 1427
  • [29] The representation of shape in the context of visual object categorization tasks
    Op de Beeck, H
    Béatse, E
    Wagemans, J
    Sunaert, S
    Van Hecke, P
    NEUROIMAGE, 2000, 12 (01) : 28 - 40
  • [30] Curriculum Learning of Multiple Tasks
    Pentina, Anastasia
    Sharmanska, Viktoriia
    Lampert, Christoph H.
    2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015, : 5492 - 5500