CLOTHO: AN AUDIO CAPTIONING DATASET

被引：0

作者：

Drossos, Konstantinos ^{[1
]}

Lipping, Samuel ^{[1
]}

Virtanen, Tuomas ^{[1
]}

机构：

[1] Tampere Univ, Audio Res Grp, Tampere, Finland

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年

基金：

欧洲研究理事会;

关键词：

audio captioning; dataset; Clotho;

D O I：

10.1109/icassp40776.2020.9052990

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online(1).

引用

页码：736 / 740

页数：5

共 50 条

[31] MITIGATING DATASET BIAS IN IMAGE CAPTIONING THROUGH CLIP CONFOUNDER-FREE CAPTIONING NETWORK
Kim, Yeonju
Kim, Junho
Lee, Byung-Kwan
Shin, Sebin
Ro, Yong Man
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1720 - 1724
[32] ArtCap: A Dataset for Image Captioning of Fine Art Paintings
Lu, Yue
Guo, Chao
Dai, Xingyuan
Wang, Fei-Yue
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024, 11 (01) : 576 - 587
[33] Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding
Liu, Jizhong
Li, Gang
Zhang, Junbo
Dinkel, Heinrich
Wang, Yongqing
Yan, Zhiyong
Wang, Yujun
Bin Wang
INTERSPEECH 2024, 2024, : 1135 - 1139
[34] A Transformer-based Audio Captioning Model with Keyword Estimation
Koizumi, Yuma
Masumura, Ryo
Nishida, Kyosuke
Yasuda, Masahiro
Saito, Shoichiro
INTERSPEECH 2020, 2020, : 1977 - 1981
[35] Automated audio captioning: an overview of recent progress and new challenges
Xinhao Mei
Xubo Liu
Mark D. Plumbley
Wenwu Wang
EURASIP Journal on Audio, Speech, and Music Processing, 2022
[36] Enhance Temporal Relations in Audio Captioning with Sound Event Detection
Xie, Zeyu
Xu, Xuenan
Wu, Mengyue
Yu, Kai
INTERSPEECH 2023, 2023, : 4179 - 4183
[37] Using various pre-trained models for audio feature extraction in automated audio captioning
Won, Hyejin
Kim, Baekseung
Kwak, Il-Youp
Lim, Changwon
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 231
[38] Automated audio captioning: an overview of recent progress and new challenges
Mei, Xinhao
Liu, Xubo
Plumbley, Mark D.
Wang, Wenwu
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2022, 2022 (01)
[39] Rethinking Transfer and Auxiliary Learning for Improving Audio Captioning Transformer
Shin, Wooseok
Park, Hyun Joon
Kim, Jin Sob
Kim, Dongwon
Lee, Seungjin
Han, Sung Won
INTERSPEECH 2023, 2023, : 2128 - 2132
[40] Automated Audio Captioning with Epochal Difficult Captions for curriculum learning
Koh, Andrew
Tiwari, Soham
Siong, Chng Eng
PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1058 - 1063

← 1 2 3 4 5 →