TRAINING AUDIO CAPTIONING MODELS WITHOUT AUDIO

被引:2
|
作者
Deshmukh, Soham [1 ]
Elizalde, Benjamin [1 ]
Emmanouilidou, Dimitra [1 ]
Raj, Bhiksha [2 ]
Singh, Rita [2 ]
Wang, Huaming [1 ]
机构
[1] Microsoft, Redmond, WA 98052 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
关键词
automated audio captioning; text-only training; prefix tuning; contrastive learning;
D O I
10.1109/ICASSP48485.2024.10448115
中图分类号
学科分类号
摘要
Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an approach to train AAC systems using only text. Our approach leverages the multimodal space of contrastively trained audio-text models, such as CLAP. During training, a decoder generates captions conditioned on the pretrained CLAP text encoder. During inference, the text encoder is replaced with the pretrained CLAP audio encoder. To bridge the modality gap between text and audio embeddings, we propose the use of noise injection or a learnable adapter, during training. We find that the proposed text-only framework performs competitively with state-of-the-art models trained with paired audio, showing that efficient text-to-audio transfer is possible. Finally, we showcase both stylized audio captioning and caption enrichment while training without audio or human-created text captions.
引用
收藏
页码:371 / 375
页数:5
相关论文
共 50 条
  • [41] Local Information Assisted Attention-Free Decoder for Audio Captioning
    Xiao, Feiyang
    Guan, Jian
    Lan, Haiyan
    Zhu, Qiaoxi
    Wang, Wenwu
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1604 - 1608
  • [42] Beyond the Status Quo: A Contemporary Survey of Advances and Challenges in Audio Captioning
    Xu, Xuenan
    Xie, Zeyu
    Wu, Mengyue
    Yu, Kai
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 95 - 112
  • [43] Efficient Training of Audio Transformers with Patchout
    Koutini, Khaled
    Schlueter, Jan
    Eghbal-zadeh, Hamid
    Widmer, Gerhard
    INTERSPEECH 2022, 2022, : 2753 - 2757
  • [44] Intelligent Audio Visual Thumble Training
    Ahmad, Hafsah
    Tonelli, Alessia
    Salvagno, Valentina
    Capris, Elisabetta
    Facchini, Valentina
    Sandini, Giulio
    Gori, Monica
    PERCEPTION, 2019, 48 : 155 - 155
  • [45] Using audio on the Internet for training purposes
    Salazar, RA
    SOUTHCON/96 - CONFERENCE RECORD, 1996, : 157 - 160
  • [46] Training audio describers for art museums
    Luque Colmenero, M. Olalla
    Gallego, Silvia Soler
    LINGUISTICA ANTVERPIENSIA NEW SERIES-THEMES IN TRANSLATION STUDIES, 2019, 18 : 166 - 181
  • [47] Does Audio help in deep Audio-Visual Saliency prediction models?
    Agrawal, Ritvik
    Jyoti, Shreyank
    Girmaji, Rohit
    Sivaprasad, Sarath
    Gandhi, Vineet
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 48 - 56
  • [48] Audio for Audio is Better? An Investigation on Transfer Learning Models for Heart Sound Classification
    Koike, Tomoya
    Qian, Kun
    Kong, Qiuqiang
    Plumbley, Mark D.
    Schuller, Bjorn W.
    Yamamoto, Yoshiharu
    42ND ANNUAL INTERNATIONAL CONFERENCES OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY: ENABLING INNOVATIVE TECHNOLOGIES FOR GLOBAL HEALTHCARE EMBC'20, 2020, : 74 - 77
  • [49] Adapting Language-Audio Models as Few-Shot Audio Learners
    Liang, Jinhua
    Liu, Xubo
    Liu, Haohe
    Phan, Huy
    Benetos, Emmanouil
    Plumbley, Mark D.
    Wang, Wenwu
    INTERSPEECH 2023, 2023, : 276 - 280
  • [50] Learning General Audio Representations With Large-Scale Training of Patchout Audio Transformers
    Koutini, Khaled
    Masoudian, Shahed
    Schmid, Florian
    Eghbal-zadeh, Hamid
    Schlueter, Jan
    Widmer, Gerhard
    HEAR: HOLISTIC EVALUATION OF AUDIO REPRESENTATIONS, VOL 166, 2021, 166 : 65 - 88