CLOTHO: AN AUDIO CAPTIONING DATASET

被引：0

作者：

Drossos, Konstantinos ^{[1
]}

Lipping, Samuel ^{[1
]}

Virtanen, Tuomas ^{[1
]}

机构：

[1] Tampere Univ, Audio Res Grp, Tampere, Finland

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年

基金：

欧洲研究理事会;

关键词：

audio captioning; dataset; Clotho;

D O I：

10.1109/icassp40776.2020.9052990

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online(1).

引用

页码：736 / 740

页数：5

共 50 条

[41] M-VAD names: a dataset for video captioning with naming
Pini, Stefano
Cornia, Marcella
Bolelli, Federico
Baraldi, Lorenzo
Cucchiara, Rita
MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (10) : 14007 - 14027
[42] Sieve: Multimodal Dataset Pruning Using Image Captioning Models
Mahmouc, Anas
Elhoushi, Mostafa
Abbass, Amro
Yang, Yu
Ardalani, Newsha
Leather, Hugh
Morcos, Art S.
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 22423 - 22432
[43] M-VAD names: a dataset for video captioning with naming
Stefano Pini
Marcella Cornia
Federico Bolelli
Lorenzo Baraldi
Rita Cucchiara
Multimedia Tools and Applications, 2019, 78 : 14007 - 14027
[44] Smartphone Audio Replay Attacks Dataset
Mandalapu, Hareesh
Ramachandra, Raghavendra
Busch, Christoph
2021 9TH INTERNATIONAL WORKSHOP ON BIOMETRICS AND FORENSICS (IWBF 2021), 2021,
[45] Towards Image Captioning for the Portuguese Language: Evaluation on a Translated Dataset
Gondim, Joao
Claro, Daniela Barreiro
Souza, Marlo
ICEIS: PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS - VOL 1, 2022, : 384 - 393
[46] AUDIO SET: AN ONTOLOGY AND HUMAN-LABELED DATASET FOR AUDIO EVENTS
Gemmeke, Jort F.
Ellis, Daniel P. W.
Freedman, Dylan
Jansen, Aren
lawrence, Wade
Moore, R. Channing
Plakal, Manoj
Ritter, Marvin
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 776 - 780
[47] Student Class Behavior Dataset: a video dataset for recognizing, detecting, and captioning students' behaviors in classroom scenes
Sun, Bo
Wu, Yong
Zhao, Kaijie
He, Jun
Yu, Lejun
Yan, Huanqing
Luo, Ao
NEURAL COMPUTING & APPLICATIONS, 2021, 33 (14): : 8335 - 8354
[48] Student Class Behavior Dataset: a video dataset for recognizing, detecting, and captioning students’ behaviors in classroom scenes
Bo Sun
Yong Wu
Kaijie Zhao
Jun He
Lejun Yu
Huanqing Yan
Ao Luo
Neural Computing and Applications, 2021, 33 : 8335 - 8354
[49] DIVERSITY-CONTROLLABLE AND ACCURATE AUDIO CAPTIONING BASED ON NEURAL CONDITION
Xu, Xuenan
Wu, Mengyue
Yu, Kai
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 971 - 975
[50] Beyond the Status Quo: A Contemporary Survey of Advances and Challenges in Audio Captioning
Xu, Xuenan
Xie, Zeyu
Wu, Mengyue
Yu, Kai
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 95 - 112

← 1 2 3 4 5 →