Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

被引:0
|
作者
Zhao, Yanpeng [1 ]
Hessel, Jack [3 ]
Yu, Youngjae [3 ]
Lu, Ximing [2 ,3 ]
Zellers, Rowan [2 ]
Choi, Yejin [2 ,3 ]
机构
[1] Univ Edinburgh, Inst Language Cognit & Computat, Edinburgh, Midlothian, Scotland
[2] Univ Washington, Paul G Allen Sch Comp Sci & Engn, Seattle, WA 98195 USA
[3] Allen Inst Artificial Intelligence, Seattle, WA USA
基金
欧洲研究理事会; 美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machines that can represent and describe environmental soundscapes have practical potential, e.g., for audio tagging and captioning. Prevailing learning paradigms of audio-text connections have been relying on parallel audio-text data, which is, however, scarcely available on the web. We propose VIP-ANT that induces Audio-Text alignment without using any parallel audio-text data. Our key idea is to share the image modality between bi-modal image-text representations and bi-modal image-audio representations; the image modality functions as a pivot and connects audio and text in a trimodal embedding space implicitly. In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2.2% R@1. We further investigate cases of minimal audio-text supervision, finding that, e.g., just a few hundred supervised audio-text pairs increase the zero-shot audio classification accuracy by 8% on US8K. However, to match human parity on some zero-shot tasks, our empirical scaling experiments suggest that we would need about 2(21) approximate to 2M supervised audio-caption pairs. Our work opens up new avenues for learning audio-text connections with little to no parallel audio-text data.
引用
收藏
页码:4492 / 4507
页数:16
相关论文
共 50 条
  • [1] Audio-visual integration through the parallel visual pathways
    Kaposvari, Peter
    Csete, Gergo
    Bognar, Anna
    Csibri, Peter
    Toth, Eszter
    Szabo, Nikoletta
    Vecsei, Laszlo
    Sary, Gyula
    Kincses, Zsigmond Tamas
    BRAIN RESEARCH, 2015, 1624 : 71 - 77
  • [2] Connecting Knowledge for Text Construction through the Use of Graphic Organizers
    Camila Reyes, Elsy
    COLOMBIAN APPLIED LINGUISTICS JOURNAL, 2011, 13 (01) : 7 - 19
  • [3] Connecting the dots between data management and research integrity
    Moore, Scott
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2018, 256
  • [4] Connecting the Dots: Explaining Relationships Between Unconnected Entities in a Knowledge Graph
    Aggarwal, Nitish
    Bhatia, Sumit
    Misra, Vinith
    SEMANTIC WEB, ESWC 2016, 2016, 9989 : 35 - 39
  • [5] Editorial: Connecting the Dots Between Good Data and Good Decisions
    Fagan, Jody Condit
    JOURNAL OF WEB LIBRARIANSHIP, 2012, 6 (02) : 87 - 93
  • [6] Educational Data Virtual Lab: Connecting the Dots Between Data Visualization and Analysis
    Lopez-Pernas, Sonsoles
    Munoz-Arcentales, Andres
    Aparicio, Carlos
    Barra, Enrique
    Gordillo, Aldo
    Salvachua, Joaquin
    Quemada, Juan
    IEEE COMPUTER GRAPHICS AND APPLICATIONS, 2022, 42 (05) : 76 - 83
  • [7] Multimodal Emotion Recognition Using Transfer Learning on Audio and Text Data
    Deng, James J.
    Leung, Clement H. C.
    Li, Yuanxi
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS, ICCSA 2021, PT III, 2021, 12951 : 552 - 563
  • [8] Connecting the Dots: Bridging Innovation Management and Technology Transfer Through ISO 56002
    Barboza, Bertiene Maria Lack
    Kovaleski, Joao Luiz
    Zola, Fernanda Cavichioli
    Chiroli, Daiane Maria de Genaro
    JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2024, 23 (06)
  • [9] Connecting Knowledge to Data Through Transformations in KnowID: System Description
    Fillottrani, Pablo R.
    Jamieson, Stephan
    Keet, C. Maria
    KUNSTLICHE INTELLIGENZ, 2020, 34 (03): : 373 - 379
  • [10] Coupled Knowledge Transfer for Visual Data Recognition
    Meng, Min
    Lan, Mengcheng
    Yu, Jun
    Wu, Jigang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (05) : 1776 - 1789