Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

被引：0

作者：

Zhao, Yanpeng ^{[1
]}

Hessel, Jack ^{[3
]}

Yu, Youngjae ^{[3
]}

Lu, Ximing ^{[2
,3
]}

Zellers, Rowan ^{[2
]}

Choi, Yejin ^{[2
,3
]}

机构：

[1] Univ Edinburgh, Inst Language Cognit & Computat, Edinburgh, Midlothian, Scotland

[2] Univ Washington, Paul G Allen Sch Comp Sci & Engn, Seattle, WA 98195 USA

[3] Allen Inst Artificial Intelligence, Seattle, WA USA

来源：

NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES | 2022年

基金：

欧洲研究理事会; 美国国家科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Machines that can represent and describe environmental soundscapes have practical potential, e.g., for audio tagging and captioning. Prevailing learning paradigms of audio-text connections have been relying on parallel audio-text data, which is, however, scarcely available on the web. We propose VIP-ANT that induces Audio-Text alignment without using any parallel audio-text data. Our key idea is to share the image modality between bi-modal image-text representations and bi-modal image-audio representations; the image modality functions as a pivot and connects audio and text in a trimodal embedding space implicitly. In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2.2% R@1. We further investigate cases of minimal audio-text supervision, finding that, e.g., just a few hundred supervised audio-text pairs increase the zero-shot audio classification accuracy by 8% on US8K. However, to match human parity on some zero-shot tasks, our empirical scaling experiments suggest that we would need about 2(21) approximate to 2M supervised audio-caption pairs. Our work opens up new avenues for learning audio-text connections with little to no parallel audio-text data.

引用

页码：4492 / 4507

页数：16

共 50 条

[31] Improving speech embedding using crossmodal transfer learning with audio-visual data
Nam Le
Odobez, Jean-Marc
MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (11) : 15681 - 15704
[32] Tracing university-industry knowledge transfer through a text mining approach
Woltmann, Sabrina L.
Alkaersig, Lars
SCIENTOMETRICS, 2018, 117 (01) : 449 - 472
[33] Multimodal big data affective analytics: A comprehensive survey using text, audio, visual and physiological signals
Shoumy, Nusrat J.
Ang, Li-Minn
Seng, Kah Phooi
Rahaman, D. M. Motiur
Zia, Tanveer
JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 2020, 149
[34] Exploring Effective Relationships Between Visual-Audio Channels in Data Visualization
Sadia Rubab
Lingyun Yu
Junxiu Tang
Yingcai Wu
Journal of Visualization, 2023, 26 : 937 - 956
[35] Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation
Zheng, Rui-Chen
Ai, Yang
Ling, Zhen-Hua
INTERSPEECH 2023, 2023, : 844 - 848
[36] Exploring Effective Relationships Between Visual-Audio Channels in Data Visualization
Rubab, Sadia
Yu, Lingyun
Tang, Junxiu
Wu, Yingcai
JOURNAL OF VISUALIZATION, 2023, 26 (04) : 937 - 956
[37] LANGUAGE TRANSFER OF AUDIO WORD2VEC: LEARNING AUDIO SEGMENT REPRESENTATIONS WITHOUT TARGET LANGUAGE DATA
Shen, Chia-Hao
Sung, Janet Y.
Lee, Hung-Yi
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2231 - 2235
[38] HEAT-TRANSFER THROUGH A GAS BETWEEN PARALLEL PLATES
WALDMANN, L
ZEITSCHRIFT FUR NATURFORSCHUNG SECTION A-A JOURNAL OF PHYSICAL SCIENCES, 1977, 32 (09): : 914 - 926
[39] VISUAL DATA AND K 12 STUDENTS - TRANSLATIONS THROUGH VISUAL ANALYTICS AND KNOWLEDGE VISUALIZATION
Stenliden, Linnea
Johansson, Jimmy
Nissen, Jorgen
ICERI2015: 8TH INTERNATIONAL CONFERENCE OF EDUCATION, RESEARCH AND INNOVATION, 2015, : 5189 - 5198
[40] The effect of knowledge of results through visual feedback on the precision of navigation without vision
Sarjeant, Jenna
Grostern, Jessica
Paquet, Nicole
Lajoie, Yves
JOURNAL OF SPORT & EXERCISE PSYCHOLOGY, 2016, 38 : S105 - S105

← 1 2 3 4 5 →