CLOTHO: AN AUDIO CAPTIONING DATASET

被引:0
|
作者
Drossos, Konstantinos [1 ]
Lipping, Samuel [1 ]
Virtanen, Tuomas [1 ]
机构
[1] Tampere Univ, Audio Res Grp, Tampere, Finland
基金
欧洲研究理事会;
关键词
audio captioning; dataset; Clotho;
D O I
10.1109/icassp40776.2020.9052990
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online(1).
引用
收藏
页码:736 / 740
页数:5
相关论文
共 50 条
  • [1] Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering
    Lipping, Samuel
    Sudarsanam, Parthasaarathy
    Drossos, Konstantinos
    Virtanen, Tuomas
    2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 1140 - 1144
  • [2] AUDIO DIFFERENCE LEARNING FOR AUDIO CAPTIONING
    Komatsu, Tatsuya
    Fujita, Yusuke
    Takeda, Kazuya
    Toda, Tomoki
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 1456 - 1460
  • [3] WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
    Mei, Xinhao
    Meng, Chutong
    Liu, Haohe
    Kong, Qiuqiang
    Ko, Tom
    Zhao, Chengqi
    Plumbley, Mark D.
    Zou, Yuexian
    Wang, Wenwu
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3339 - 3354
  • [4] TRAINING AUDIO CAPTIONING MODELS WITHOUT AUDIO
    Deshmukh, Soham
    Elizalde, Benjamin
    Emmanouilidou, Dimitra
    Raj, Bhiksha
    Singh, Rita
    Wang, Huaming
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 371 - 375
  • [5] Audio Captioning Based on Combined Audio and Semantic Embeddings
    Eren, Aysegul Ozkaya
    Sert, Mustafa
    2020 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM 2020), 2020, : 41 - 48
  • [6] Graph Attention for Automated Audio Captioning
    Xiao, Feiyang
    Guan, Jian
    Zhu, Qiaoxi
    Wang, Wenwu
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 413 - 417
  • [7] Automated Audio Captioning With Topic Modeling
    Eren, Aysegul Ozkaya
    Sert, Mustafa
    IEEE ACCESS, 2023, 11 : 4983 - 4991
  • [8] MEMECAP: A Dataset for Captioning and Interpreting Memes
    Hwang, EunJeong
    Shwartz, Vered
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 1433 - 1445
  • [9] Joint speech recognition and audio captioning
    Carnegie Mellon University, United States
    不详
    arXiv, 1600,
  • [10] JOINT SPEECH RECOGNITION AND AUDIO CAPTIONING
    Narisetty, Chaitanya
    Tsunoo, Emiru
    Chang, Xuankai
    Kashiwagi, Yosuke
    Hentschel, Michael
    Watanabe, Shinji
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7892 - 7896