Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

被引:0
|
作者
Zhao, Yanpeng [1 ]
Hessel, Jack [3 ]
Yu, Youngjae [3 ]
Lu, Ximing [2 ,3 ]
Zellers, Rowan [2 ]
Choi, Yejin [2 ,3 ]
机构
[1] Univ Edinburgh, Inst Language Cognit & Computat, Edinburgh, Midlothian, Scotland
[2] Univ Washington, Paul G Allen Sch Comp Sci & Engn, Seattle, WA 98195 USA
[3] Allen Inst Artificial Intelligence, Seattle, WA USA
基金
欧洲研究理事会; 美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machines that can represent and describe environmental soundscapes have practical potential, e.g., for audio tagging and captioning. Prevailing learning paradigms of audio-text connections have been relying on parallel audio-text data, which is, however, scarcely available on the web. We propose VIP-ANT that induces Audio-Text alignment without using any parallel audio-text data. Our key idea is to share the image modality between bi-modal image-text representations and bi-modal image-audio representations; the image modality functions as a pivot and connects audio and text in a trimodal embedding space implicitly. In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2.2% R@1. We further investigate cases of minimal audio-text supervision, finding that, e.g., just a few hundred supervised audio-text pairs increase the zero-shot audio classification accuracy by 8% on US8K. However, to match human parity on some zero-shot tasks, our empirical scaling experiments suggest that we would need about 2(21) approximate to 2M supervised audio-caption pairs. Our work opens up new avenues for learning audio-text connections with little to no parallel audio-text data.
引用
收藏
页码:4492 / 4507
页数:16
相关论文
共 50 条
  • [31] Improving speech embedding using crossmodal transfer learning with audio-visual data
    Nam Le
    Odobez, Jean-Marc
    MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (11) : 15681 - 15704
  • [32] Tracing university-industry knowledge transfer through a text mining approach
    Woltmann, Sabrina L.
    Alkaersig, Lars
    SCIENTOMETRICS, 2018, 117 (01) : 449 - 472
  • [33] Multimodal big data affective analytics: A comprehensive survey using text, audio, visual and physiological signals
    Shoumy, Nusrat J.
    Ang, Li-Minn
    Seng, Kah Phooi
    Rahaman, D. M. Motiur
    Zia, Tanveer
    JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 2020, 149
  • [34] Exploring Effective Relationships Between Visual-Audio Channels in Data Visualization
    Sadia Rubab
    Lingyun Yu
    Junxiu Tang
    Yingcai Wu
    Journal of Visualization, 2023, 26 : 937 - 956
  • [35] Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation
    Zheng, Rui-Chen
    Ai, Yang
    Ling, Zhen-Hua
    INTERSPEECH 2023, 2023, : 844 - 848
  • [36] Exploring Effective Relationships Between Visual-Audio Channels in Data Visualization
    Rubab, Sadia
    Yu, Lingyun
    Tang, Junxiu
    Wu, Yingcai
    JOURNAL OF VISUALIZATION, 2023, 26 (04) : 937 - 956
  • [37] LANGUAGE TRANSFER OF AUDIO WORD2VEC: LEARNING AUDIO SEGMENT REPRESENTATIONS WITHOUT TARGET LANGUAGE DATA
    Shen, Chia-Hao
    Sung, Janet Y.
    Lee, Hung-Yi
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2231 - 2235
  • [38] HEAT-TRANSFER THROUGH A GAS BETWEEN PARALLEL PLATES
    WALDMANN, L
    ZEITSCHRIFT FUR NATURFORSCHUNG SECTION A-A JOURNAL OF PHYSICAL SCIENCES, 1977, 32 (09): : 914 - 926
  • [39] VISUAL DATA AND K 12 STUDENTS - TRANSLATIONS THROUGH VISUAL ANALYTICS AND KNOWLEDGE VISUALIZATION
    Stenliden, Linnea
    Johansson, Jimmy
    Nissen, Jorgen
    ICERI2015: 8TH INTERNATIONAL CONFERENCE OF EDUCATION, RESEARCH AND INNOVATION, 2015, : 5189 - 5198
  • [40] The effect of knowledge of results through visual feedback on the precision of navigation without vision
    Sarjeant, Jenna
    Grostern, Jessica
    Paquet, Nicole
    Lajoie, Yves
    JOURNAL OF SPORT & EXERCISE PSYCHOLOGY, 2016, 38 : S105 - S105