CLOTHO: AN AUDIO CAPTIONING DATASET

被引:0
|
作者
Drossos, Konstantinos [1 ]
Lipping, Samuel [1 ]
Virtanen, Tuomas [1 ]
机构
[1] Tampere Univ, Audio Res Grp, Tampere, Finland
基金
欧洲研究理事会;
关键词
audio captioning; dataset; Clotho;
D O I
10.1109/icassp40776.2020.9052990
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online(1).
引用
收藏
页码:736 / 740
页数:5
相关论文
共 50 条
  • [21] Watching Clotho
    Lauer, Melissa
    FOURTH GENRE-EXPLORATIONS IN NONFICTION, 2023, 25 (01) : 14 - 23
  • [22] Clotho
    O'Reilly, Caitriona
    PLOUGHSHARES, 2015, 41 (01) : 151 - 151
  • [23] ACTUAL: Audio Captioning With Caption Feature Space Regularization
    Zhang, Yiming
    Yu, Hong
    Du, Ruoyi
    Tan, Zheng-Hua
    Wang, Wenwu
    Ma, Zhanyu
    Dong, Yuan
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2643 - 2657
  • [24] Leveraging Pre-trained BERT for Audio Captioning
    Liu, Xubo
    Mei, Xinhao
    Huang, Qiushi
    Sun, Jianyuan
    Zhao, Jinzheng
    Liu, Haohe
    Plumbley, Mark D.
    Kilic, Volkan
    Wang, Wenwu
    2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 1145 - 1149
  • [25] Design of an audio advertisement dataset
    Fu, Yutao
    Liu, Jihong
    Zhang, Qi
    Geng, Yuting
    SIXTH INTERNATIONAL CONFERENCE ON ELECTRONICS AND INFORMATION ENGINEERING, 2015, 9794
  • [26] Aesthetic image captioning on the FAE-Captions dataset
    Jin, Xin
    Lv, Jianwen
    Zhou, Xinghui
    Xiao, Chaoen
    Li, Xiaodong
    Zhao, Shu
    COMPUTERS & ELECTRICAL ENGINEERING, 2022, 101
  • [27] IncreasingWeb3D Accessibility with Audio Captioning
    Polys, Nicholas F.
    Wasi, Sheeban
    28TH INTERNATIONAL CONFERENCE ON WEB3D TECHNOLOGY, WEB3D 2023, 2023,
  • [28] SEEING AND HEARING TOO: AUDIO REPRESENTATION FOR VIDEO CAPTIONING
    Chuang, Shun-Po
    Wan, Chia-Hung
    Huang, Pang-Chi
    Yang, Chi-Yu
    Lee, Hung-Yi
    2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 381 - 388
  • [29] FeatureCut: An Adaptive Data Augmentation for Automated Audio Captioning
    Ye, Zhongjie
    Wang, Yuqing
    Wang, Helin
    Yang, Dongchao
    Zou, Yuexian
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 313 - 318
  • [30] A dental intraoral image dataset of gingivitis for image captioning
    Duy, Hoang Bao
    Hue, Tran Thi
    Son, Tong Minh
    Nghia, Le Long
    Lan, Luong Thi Hong
    Duc, Nguyen Minh
    Son, Le Hoang
    DATA IN BRIEF, 2024, 57