TRAINING AUDIO CAPTIONING MODELS WITHOUT AUDIO

被引:2
|
作者
Deshmukh, Soham [1 ]
Elizalde, Benjamin [1 ]
Emmanouilidou, Dimitra [1 ]
Raj, Bhiksha [2 ]
Singh, Rita [2 ]
Wang, Huaming [1 ]
机构
[1] Microsoft, Redmond, WA 98052 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
关键词
automated audio captioning; text-only training; prefix tuning; contrastive learning;
D O I
10.1109/ICASSP48485.2024.10448115
中图分类号
学科分类号
摘要
Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an approach to train AAC systems using only text. Our approach leverages the multimodal space of contrastively trained audio-text models, such as CLAP. During training, a decoder generates captions conditioned on the pretrained CLAP text encoder. During inference, the text encoder is replaced with the pretrained CLAP audio encoder. To bridge the modality gap between text and audio embeddings, we propose the use of noise injection or a learnable adapter, during training. We find that the proposed text-only framework performs competitively with state-of-the-art models trained with paired audio, showing that efficient text-to-audio transfer is possible. Finally, we showcase both stylized audio captioning and caption enrichment while training without audio or human-created text captions.
引用
收藏
页码:371 / 375
页数:5
相关论文
共 50 条
  • [1] DIVERSE AUDIO CAPTIONING VIA ADVERSARIAL TRAINING
    Mei, Xinhao
    Liu, Xubo
    Sun, Jianyuan
    Plumbley, Mark D.
    Wang, Wenwu
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8882 - 8886
  • [2] AUDIO DIFFERENCE LEARNING FOR AUDIO CAPTIONING
    Komatsu, Tatsuya
    Fujita, Yusuke
    Takeda, Kazuya
    Toda, Tomoki
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 1456 - 1460
  • [3] Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding
    Liu, Jizhong
    Li, Gang
    Zhang, Junbo
    Dinkel, Heinrich
    Wang, Yongqing
    Yan, Zhiyong
    Wang, Yujun
    Bin Wang
    INTERSPEECH 2024, 2024, : 1135 - 1139
  • [4] Audio Captioning Based on Combined Audio and Semantic Embeddings
    Eren, Aysegul Ozkaya
    Sert, Mustafa
    2020 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM 2020), 2020, : 41 - 48
  • [5] Using various pre-trained models for audio feature extraction in automated audio captioning
    Won, Hyejin
    Kim, Baekseung
    Kwak, Il-Youp
    Lim, Changwon
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 231
  • [6] CLOTHO: AN AUDIO CAPTIONING DATASET
    Drossos, Konstantinos
    Lipping, Samuel
    Virtanen, Tuomas
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 736 - 740
  • [7] Graph Attention for Automated Audio Captioning
    Xiao, Feiyang
    Guan, Jian
    Zhu, Qiaoxi
    Wang, Wenwu
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 413 - 417
  • [8] Automated Audio Captioning With Topic Modeling
    Eren, Aysegul Ozkaya
    Sert, Mustafa
    IEEE ACCESS, 2023, 11 : 4983 - 4991
  • [9] Joint speech recognition and audio captioning
    Carnegie Mellon University, United States
    不详
    arXiv, 1600,
  • [10] JOINT SPEECH RECOGNITION AND AUDIO CAPTIONING
    Narisetty, Chaitanya
    Tsunoo, Emiru
    Chang, Xuankai
    Kashiwagi, Yosuke
    Hentschel, Michael
    Watanabe, Shinji
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7892 - 7896