CLOTHO: AN AUDIO CAPTIONING DATASET

被引:0
|
作者
Drossos, Konstantinos [1 ]
Lipping, Samuel [1 ]
Virtanen, Tuomas [1 ]
机构
[1] Tampere Univ, Audio Res Grp, Tampere, Finland
基金
欧洲研究理事会;
关键词
audio captioning; dataset; Clotho;
D O I
10.1109/icassp40776.2020.9052990
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online(1).
引用
收藏
页码:736 / 740
页数:5
相关论文
共 50 条
  • [31] MITIGATING DATASET BIAS IN IMAGE CAPTIONING THROUGH CLIP CONFOUNDER-FREE CAPTIONING NETWORK
    Kim, Yeonju
    Kim, Junho
    Lee, Byung-Kwan
    Shin, Sebin
    Ro, Yong Man
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1720 - 1724
  • [32] ArtCap: A Dataset for Image Captioning of Fine Art Paintings
    Lu, Yue
    Guo, Chao
    Dai, Xingyuan
    Wang, Fei-Yue
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024, 11 (01) : 576 - 587
  • [33] Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding
    Liu, Jizhong
    Li, Gang
    Zhang, Junbo
    Dinkel, Heinrich
    Wang, Yongqing
    Yan, Zhiyong
    Wang, Yujun
    Bin Wang
    INTERSPEECH 2024, 2024, : 1135 - 1139
  • [34] A Transformer-based Audio Captioning Model with Keyword Estimation
    Koizumi, Yuma
    Masumura, Ryo
    Nishida, Kyosuke
    Yasuda, Masahiro
    Saito, Shoichiro
    INTERSPEECH 2020, 2020, : 1977 - 1981
  • [35] Automated audio captioning: an overview of recent progress and new challenges
    Xinhao Mei
    Xubo Liu
    Mark D. Plumbley
    Wenwu Wang
    EURASIP Journal on Audio, Speech, and Music Processing, 2022
  • [36] Enhance Temporal Relations in Audio Captioning with Sound Event Detection
    Xie, Zeyu
    Xu, Xuenan
    Wu, Mengyue
    Yu, Kai
    INTERSPEECH 2023, 2023, : 4179 - 4183
  • [37] Using various pre-trained models for audio feature extraction in automated audio captioning
    Won, Hyejin
    Kim, Baekseung
    Kwak, Il-Youp
    Lim, Changwon
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 231
  • [38] Automated audio captioning: an overview of recent progress and new challenges
    Mei, Xinhao
    Liu, Xubo
    Plumbley, Mark D.
    Wang, Wenwu
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2022, 2022 (01)
  • [39] Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer
    Shin, Wooseok
    Park, Hyun Joon
    Kim, Jin Sob
    Kim, Dongwon
    Lee, Seungjin
    Han, Sung Won
    INTERSPEECH 2023, 2023, : 2128 - 2132
  • [40] Automated Audio Captioning with Epochal Difficult Captions for curriculum learning
    Koh, Andrew
    Tiwari, Soham
    Siong, Chng Eng
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1058 - 1063