TRAINING AUDIO CAPTIONING MODELS WITHOUT AUDIO

被引:2
|
作者
Deshmukh, Soham [1 ]
Elizalde, Benjamin [1 ]
Emmanouilidou, Dimitra [1 ]
Raj, Bhiksha [2 ]
Singh, Rita [2 ]
Wang, Huaming [1 ]
机构
[1] Microsoft, Redmond, WA 98052 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
关键词
automated audio captioning; text-only training; prefix tuning; contrastive learning;
D O I
10.1109/ICASSP48485.2024.10448115
中图分类号
学科分类号
摘要
Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an approach to train AAC systems using only text. Our approach leverages the multimodal space of contrastively trained audio-text models, such as CLAP. During training, a decoder generates captions conditioned on the pretrained CLAP text encoder. During inference, the text encoder is replaced with the pretrained CLAP audio encoder. To bridge the modality gap between text and audio embeddings, we propose the use of noise injection or a learnable adapter, during training. We find that the proposed text-only framework performs competitively with state-of-the-art models trained with paired audio, showing that efficient text-to-audio transfer is possible. Finally, we showcase both stylized audio captioning and caption enrichment while training without audio or human-created text captions.
引用
收藏
页码:371 / 375
页数:5
相关论文
共 50 条
  • [31] TRAINING OF TONMEISTERS AND AUDIO ENGINEERS
    STRASHUN, L
    JOURNAL OF THE AUDIO ENGINEERING SOCIETY, 1986, 34 (05): : 390 - 390
  • [32] WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
    Mei, Xinhao
    Meng, Chutong
    Liu, Haohe
    Kong, Qiuqiang
    Ko, Tom
    Zhao, Chengqi
    Plumbley, Mark D.
    Zou, Yuexian
    Wang, Wenwu
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3339 - 3354
  • [33] PAM: Prompting Audio-Language Models for Audio Quality Assessment
    Deshmukh, Soham
    Alharthi, Dareen
    Elizalde, Benjamin
    Gamper, Hannes
    Al Ismail, Mahmoud
    Singh, Rita
    Raj, Bhiksha
    Wang, Huaming
    INTERSPEECH 2024, 2024, : 3320 - 3324
  • [34] Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models
    Yang, Hao
    Qu, Lizhen
    Shareghi, Ehsan
    Haffari, Gholamreza
    arXiv,
  • [35] Diffusion Models for Audio Restoration
    Lemercier, Jean-Marie
    Richter, Julius
    Welker, Simon
    Moliner, Eloi
    Vaelimaeki, Vesa
    Gerkmann, Timo
    IEEE SIGNAL PROCESSING MAGAZINE, 2024, 41 (06) : 72 - 84
  • [36] Compositional Models for Audio Processing
    Virtanen, Tuomas
    Gemmeke, Jort F.
    Raj, Bhiksha
    Smaragdis, Paris
    IEEE SIGNAL PROCESSING MAGAZINE, 2015, 32 (02) : 125 - 144
  • [37] Research on Potentialities of Audio Information Recovery from Video Without Audio Track
    Lykov Y.V.
    Presniakova H.D.
    Lykova A.A.
    Radioelectronics and Communications Systems, 2019, 62 (06) : 301 - 309
  • [38] TRAINING OF ERROR-CORRECTIVE MODEL FOR ASR WITHOUT USING AUDIO DATA
    Kurata, Gakuto
    Itoh, Nobuyasu
    Nishimura, Masafumi
    2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 5576 - 5579
  • [39] DIVERSITY-CONTROLLABLE AND ACCURATE AUDIO CAPTIONING BASED ON NEURAL CONDITION
    Xu, Xuenan
    Wu, Mengyue
    Yu, Kai
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 971 - 975
  • [40] INVESTIGATING LOCAL AND GLOBAL INFORMATION FOR AUTOMATED AUDIO CAPTIONING WITH TRANSFER LEARNING
    Xu, Xuenan
    Dinkel, Heinrich
    Wu, Mengyue
    Xie, Zeyu
    Yu, Kai
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 905 - 909