CLOTHO: AN AUDIO CAPTIONING DATASET

被引:0
|
作者
Drossos, Konstantinos [1 ]
Lipping, Samuel [1 ]
Virtanen, Tuomas [1 ]
机构
[1] Tampere Univ, Audio Res Grp, Tampere, Finland
基金
欧洲研究理事会;
关键词
audio captioning; dataset; Clotho;
D O I
10.1109/icassp40776.2020.9052990
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online(1).
引用
收藏
页码:736 / 740
页数:5
相关论文
共 50 条
  • [41] M-VAD names: a dataset for video captioning with naming
    Pini, Stefano
    Cornia, Marcella
    Bolelli, Federico
    Baraldi, Lorenzo
    Cucchiara, Rita
    MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (10) : 14007 - 14027
  • [42] Sieve: Multimodal Dataset Pruning Using Image Captioning Models
    Mahmouc, Anas
    Elhoushi, Mostafa
    Abbass, Amro
    Yang, Yu
    Ardalani, Newsha
    Leather, Hugh
    Morcos, Art S.
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 22423 - 22432
  • [43] M-VAD names: a dataset for video captioning with naming
    Stefano Pini
    Marcella Cornia
    Federico Bolelli
    Lorenzo Baraldi
    Rita Cucchiara
    Multimedia Tools and Applications, 2019, 78 : 14007 - 14027
  • [44] Smartphone Audio Replay Attacks Dataset
    Mandalapu, Hareesh
    Ramachandra, Raghavendra
    Busch, Christoph
    2021 9TH INTERNATIONAL WORKSHOP ON BIOMETRICS AND FORENSICS (IWBF 2021), 2021,
  • [45] Towards Image Captioning for the Portuguese Language: Evaluation on a Translated Dataset
    Gondim, Joao
    Claro, Daniela Barreiro
    Souza, Marlo
    ICEIS: PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS - VOL 1, 2022, : 384 - 393
  • [46] AUDIO SET: AN ONTOLOGY AND HUMAN-LABELED DATASET FOR AUDIO EVENTS
    Gemmeke, Jort F.
    Ellis, Daniel P. W.
    Freedman, Dylan
    Jansen, Aren
    lawrence, Wade
    Moore, R. Channing
    Plakal, Manoj
    Ritter, Marvin
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 776 - 780
  • [47] Student Class Behavior Dataset: a video dataset for recognizing, detecting, and captioning students' behaviors in classroom scenes
    Sun, Bo
    Wu, Yong
    Zhao, Kaijie
    He, Jun
    Yu, Lejun
    Yan, Huanqing
    Luo, Ao
    NEURAL COMPUTING & APPLICATIONS, 2021, 33 (14): : 8335 - 8354
  • [48] Student Class Behavior Dataset: a video dataset for recognizing, detecting, and captioning students’ behaviors in classroom scenes
    Bo Sun
    Yong Wu
    Kaijie Zhao
    Jun He
    Lejun Yu
    Huanqing Yan
    Ao Luo
    Neural Computing and Applications, 2021, 33 : 8335 - 8354
  • [49] DIVERSITY-CONTROLLABLE AND ACCURATE AUDIO CAPTIONING BASED ON NEURAL CONDITION
    Xu, Xuenan
    Wu, Mengyue
    Yu, Kai
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 971 - 975
  • [50] Beyond the Status Quo: A Contemporary Survey of Advances and Challenges in Audio Captioning
    Xu, Xuenan
    Xie, Zeyu
    Wu, Mengyue
    Yu, Kai
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 95 - 112