LEARNING CONTEXTUAL TAG EMBEDDINGS FOR CROSS-MODAL ALIGNMENT OF AUDIO AND TAGS

被引:3
|
作者
Favory, Xavier [1 ]
Drossos, Konstantinos [2 ]
Virtanen, Tuomas [2 ]
Serra, Xavier [1 ]
机构
[1] Univ Pompeu Fabra, Mus Technol Grp, Barcelona, Spain
[2] Tampere Univ, Audio Res Grp, Tampere, Finland
关键词
representation learning; multimodal contrastive learning; audio classification;
D O I
10.1109/ICASSP39728.2021.9414638
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learning audio representations using an audio autoencoder (AAE), a general word embeddings model (WEM), and a multi-head self-attention (MHA) mechanism. MHA attends on the output of the WEM, providing a contextualized representation of the tags associated with the audio, and we align the output of MHA with the output of the encoder of AAE using a contrastive loss. We jointly optimize AAE and MHA and we evaluate the audio representations (i.e. the output of the encoder of AAE) by utilizing them in three different downstream tasks, namely sound, music genre, and music instrument classification. Our results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations.
引用
收藏
页码:596 / 600
页数:5
相关论文
共 50 条
  • [41] Adversarial-Metric Learning for Audio-Visual Cross-Modal Matching
    Zheng, Aihua
    Hu, Menglan
    Jiang, Bo
    Huang, Yan
    Yan, Yan
    Luo, Bin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 338 - 351
  • [42] Deep Latent Space Learning for Cross-modal Mapping of Audio and Visual Signals
    Nawaz, Shah
    Janjua, Muhammad Kamran
    Gallo, Ignazio
    Mahmood, Arif
    Calefati, Alessandro
    2019 DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA), 2019, : 83 - 89
  • [43] SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding
    Sun, Chao
    Chen, Min
    Cheng, Jialiang
    Liang, Han
    Zhu, Chuanbo
    Chen, Jincai
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 261 - 270
  • [44] Unsupervised Cross-Modal Audio Representation Learning from Unstructured Multilingual Text
    Schindler, Alexander
    Gordea, Sergiu
    Knees, Peter
    PROCEEDINGS OF THE 35TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING (SAC'20), 2020, : 706 - 713
  • [45] Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation
    Yang, Chih-Chun
    Fan, Wan-Cyuan
    Yang, Cheng-Fu
    Wang, Yu-Chiang Frank
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3036 - 3044
  • [46] A NOVEL DISTANCE LEARNING FOR ELASTIC CROSS-MODAL AUDIO-VISUAL MATCHING
    Wangrui
    Huang, Huaibo
    Zhang, Xufeng
    Ma, Jixin
    Zheng, Aihua
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2019, : 300 - 305
  • [47] LEARNING AUDIO-VISUAL CORRELATIONS FROM VARIATIONAL CROSS-MODAL GENERATION
    Zhu, Ye
    Wu, Yu
    Latapie, Hugo
    Yang, Yi
    Yan, Yan
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 4300 - 4304
  • [48] Infant cross-modal learning
    Chow, Hiu Mei
    Tsui, Angeline Sin-Mei
    Ma, Yuen Ki
    Yat, Mei Ying
    Tseng, Chia-huei
    I-PERCEPTION, 2014, 5 (04): : 463 - 463
  • [49] Cross-modal Variational Alignment of Latent Spaces
    Theodoridis, Thomas
    Chatzis, Theocharis
    Solachidis, Vassilios
    Dimitropoulos, Kosmas
    Daras, Petros
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 4127 - 4136
  • [50] Neural entity alignment with cross-modal supervision
    Su, Fenglong
    Xu, Chengjin
    Yang, Han
    Chen, Zhongwu
    Jing, Ning
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (02)