Blueprint Separable Subsampling and Aggregate Feature Conformer-Based End-to-End Neural Diarization

被引:1
|
作者
Jiao, Xiaolin [1 ]
Chen, Yaqi [2 ]
Qu, Dan [2 ]
Yang, Xukui [2 ]
机构
[1] Zhengzhou Univ, Sch Cyber Sci & Engn, Zhengzhou 450001, Peoples R China
[2] Informat Engn Univ, Sch Informat Syst Engn, Zhengzhou 450001, Peoples R China
基金
中国国家自然科学基金;
关键词
end-to-end neural diarization (EEND); blueprint separable convolution (BSConv); multi-scale feature aggregation (MFA); SPEAKER DIARIZATION; SEPARATION;
D O I
10.3390/electronics12194118
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
At present, a prevalent approach to speaker diarization is clustering based on speaker embeddings. However, this method encounters two primary issues. Firstly, it cannot directly minimize the diarization error during the training process; secondly, the majority of clustering-based methods struggle to handle speaker overlap in audio. A viable approach for addressing these issues involves adopting end-to-end speaker diarization (EEND). Nevertheless, training this EEND system generally requires lengthy audio inputs, which must be downsampled to allow efficient model processing. In this study, we develop a novel downsampling layer using blueprint separable convolution (BSConv) instead of depthwise separable convolution (DSC) as the foundational convolutional unit, which effectively preserves information from the original audio. Furthermore, we incorporate multi-scale feature aggregation (MFA) into the encoder structure to combine the features extracted by each conformer block to the output layer, consequently enhancing the expressiveness of the model's feature extraction. Lastly, we employ the conformer as the backbone network to incorporate the proposed enhancements, resulting in an EEND system named BSAC-EEND. We assess our suggested methodology on both simulated and real datasets. The experiment indicates that our proposed EEND system reduces diarization error rate (DER) by an average of 17.3% for two-speaker datasets and 12.8% for three-speaker datasets compared to the baseline.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech
    Kinoshita, Keisuke
    Delcroix, Marc
    Tawara, Naohiro
    INTERSPEECH 2021, 2021, : 3565 - 3569
  • [22] On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization
    Huang, Yiling
    Wang, Weiran
    Zhao, Guanlong
    Liao, Hank
    Xia, Wei
    Wang, Quan
    INTERSPEECH 2024, 2024, : 32 - 36
  • [23] MUTUAL LEARNING OF SINGLE- AND MULTI-CHANNEL END-TO-END NEURAL DIARIZATION
    Horiguchi, Shota
    Takashima, Yuki
    Watanabe, Shinji
    Garcia, Paola
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 620 - 625
  • [24] Self-Conditioning via Intermediate Predictions for End-to-End Neural Speaker Diarization
    Fujita, Yusuke
    Ogawa, Tetsuji
    Kobayashi, Tetsunori
    IEEE ACCESS, 2023, 11 : 140069 - 140076
  • [25] From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization
    Landini, Federico
    Lozano-Diez, Alicia
    Diez, Mireia
    Burget, Lukas
    INTERSPEECH 2022, 2022, : 5095 - 5099
  • [26] Semi-Supervised Training with Pseudo-Labeling for End-to-End Neural Diarization
    Takashima, Yuki
    Fujita, Yusuke
    Horiguchi, Shota
    Watanabe, Shinji
    Garcia, Paola
    Nagamatsu, Kenji
    INTERSPEECH 2021, 2021, : 3096 - 3100
  • [27] End-to-End Neural Speaker Diarization with an Iterative Refinement of Non-Autoregressive Attention-based Attractors
    Rybicka, Magdalena
    Villalba, Jesus
    Dehak, Najim
    Kowalczyk, Konrad
    INTERSPEECH 2022, 2022, : 5090 - 5094
  • [28] End-to-end Cooperative Localization via Neural Feature Sharing
    Gao, Letian
    Xiang, Hao
    Xia, Xin
    Ma, Jiaqi
    2024 35TH IEEE INTELLIGENT VEHICLES SYMPOSIUM, IEEE IV 2024, 2024, : 553 - 558
  • [29] Key Frame Mechanism for Efficient Conformer Based End-to-End Speech Recognition
    Fan, Peng
    Shan, Changhao
    Sun, Sining
    Yang, Qing
    Zhang, Jianwei
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 1612 - 1616
  • [30] OVERLAP-AWARE DIARIZATION: RESEGMENTATION USING NEURAL END-TO-END OVERLAPPED SPEECH DETECTION
    Bullock, Latane
    Bredin, Herve
    Garcia-Perera, Leibny Paola
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7114 - 7118