A Progressive Learning Approach for Sound Event Detection with Temporal and Spectral Features Fusion

被引:0
|
作者
Zhong, Yilin [1 ]
Fang, Zhaoer [1 ]
Wang, Jie [1 ]
Fan, Bo [1 ]
Peng, BangHuang [1 ]
机构
[1] BYD Auto Ind Co LTD, Auto Engn Res Inst, Shenzhen, Peoples R China
关键词
sound event detection; progressive learning; self-supervised learning;
D O I
10.1007/978-981-97-5594-3_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sound Event Detection (SED) has wide applications for real world systems, including automatic surveillance, smart home devices, and intelligent automobiles. While recent works in SED have achieved significant performance improvements through fine-tuning pre-trained frame-wise audio tagging (AT) models, bridging the gap between AT and SED tasks, a common limitation is their exclusive reliance on spectral features for input. This leads to a challenge for precise sound event localization. To address this issue, we proposed a novel Temporal Mask Model (TMM) extracting temporal features, integrated with the Bidirectional Encoder representation from Audio Transformers and CNN (BEATs-CNN) framework which extracts spectral features. These two types of features are fused with a progressive learning strategy, and consequently fed into a Bidirectional Gated Recurrent Unit (Bi-GRU) to generate predictions. Through extensive experimentation, we demonstrate that our approach surpasses the reported State-Of-The-Art (SOTA) model in Polyphonic Sound Detection Score-scenario1 (PSDS1) and achieves a comparable result in Polyphonic Sound Detection Score-scenario(2) (PSDS2) on the DCASE Challenge Task 4.
引用
收藏
页码:207 / 218
页数:12
相关论文
共 50 条
  • [1] Sound event detection in real-life audio using joint spectral and temporal features
    Yang, Wenjun
    Krishnan, Sridhar
    SIGNAL IMAGE AND VIDEO PROCESSING, 2018, 12 (07) : 1345 - 1352
  • [2] Sound event detection in real-life audio using joint spectral and temporal features
    Wenjun Yang
    Sridhar Krishnan
    Signal, Image and Video Processing, 2018, 12 : 1345 - 1352
  • [3] Multi-Spectral and Multi-Temporal Features Fusion With SE Network for Anomalous Sound Detection
    Kong, Dewei
    Yu, Hongjiang
    Yuan, Guoshun
    IEEE ACCESS, 2024, 12 : 167262 - 167277
  • [4] ANOMALOUS SOUND DETECTION USING SPECTRAL-TEMPORAL INFORMATION FUSION
    Liu, Youde
    Guan, Jian
    Zhu, Qiaoxi
    Wang, Wenwu
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 816 - 820
  • [5] SPECTRAL VS. SPECTRO-TEMPORAL FEATURES FOR ACOUSTIC EVENT DETECTION
    Cotton, Courtenay V.
    Ellis, Daniel P. W.
    2011 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2011, : 69 - 72
  • [6] Active Learning for Sound Event Detection
    Shuyang Zhao
    Heittola, Toni
    Virtanen, Tuomas
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 2895 - 2905
  • [7] Trainable COPE Features for Sound Event Detection
    Strisciuglio, Nicola
    Petkov, Nicolai
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS (CIARP 2019), 2019, 11896 : 599 - 609
  • [8] Learning How to Listen: A Temporal-Frequential Attention Model for Sound Event Detection
    Shen, Yu-Han
    He, Ke-Xin
    Zhang, Wei-Qiang
    INTERSPEECH 2019, 2019, : 2563 - 2567
  • [9] Spectro Temporal Fusion with CLSTM-Autoencoder based approach for Anomalous Sound Detection
    Chandrakala, S.
    Pidikiti, Akhilandeswari
    Sai Mahathi, P. V. N.
    NEURAL PROCESSING LETTERS, 2024, 56 (01)
  • [10] Spectro Temporal Fusion with CLSTM-Autoencoder based approach for Anomalous Sound Detection
    S. Chandrakala
    Akhilandeswari Pidikiti
    P. V. N. Sai Mahathi
    Neural Processing Letters, 56