CLOTHO: AN AUDIO CAPTIONING DATASET

被引：0

作者：

Drossos, Konstantinos ^{[1
]}

Lipping, Samuel ^{[1
]}

Virtanen, Tuomas ^{[1
]}

机构：

[1] Tampere Univ, Audio Res Grp, Tampere, Finland

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年

基金：

欧洲研究理事会;

关键词：

audio captioning; dataset; Clotho;

D O I：

10.1109/icassp40776.2020.9052990

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online(1).

引用

页码：736 / 740

页数：5

共 50 条

[1] Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering
Lipping, Samuel
Sudarsanam, Parthasaarathy
Drossos, Konstantinos
Virtanen, Tuomas
2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 1140 - 1144
[2] AUDIO DIFFERENCE LEARNING FOR AUDIO CAPTIONING
Komatsu, Tatsuya
Fujita, Yusuke
Takeda, Kazuya
Toda, Tomoki
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 1456 - 1460
[3] WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
Mei, Xinhao
Meng, Chutong
Liu, Haohe
Kong, Qiuqiang
Ko, Tom
Zhao, Chengqi
Plumbley, Mark D.
Zou, Yuexian
Wang, Wenwu
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3339 - 3354
[4] TRAINING AUDIO CAPTIONING MODELS WITHOUT AUDIO
Deshmukh, Soham
Elizalde, Benjamin
Emmanouilidou, Dimitra
Raj, Bhiksha
Singh, Rita
Wang, Huaming
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 371 - 375
[5] Audio Captioning Based on Combined Audio and Semantic Embeddings
Eren, Aysegul Ozkaya
Sert, Mustafa
2020 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM 2020), 2020, : 41 - 48
[6] Graph Attention for Automated Audio Captioning
Xiao, Feiyang
Guan, Jian
Zhu, Qiaoxi
Wang, Wenwu
IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 413 - 417
[7] Automated Audio Captioning With Topic Modeling
Eren, Aysegul Ozkaya
Sert, Mustafa
IEEE ACCESS, 2023, 11 : 4983 - 4991
[8] MEMECAP: A Dataset for Captioning and Interpreting Memes
Hwang, EunJeong
Shwartz, Vered
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 1433 - 1445
[9] Joint speech recognition and audio captioning
Carnegie Mellon University, United States
不详
arXiv, 1600,
[10] JOINT SPEECH RECOGNITION AND AUDIO CAPTIONING
Narisetty, Chaitanya
Tsunoo, Emiru
Chang, Xuankai
Kashiwagi, Yosuke
Hentschel, Michael
Watanabe, Shinji
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7892 - 7896

← 1 2 3 4 5 →