Blueprint Separable Subsampling and Aggregate Feature Conformer-Based End-to-End Neural Diarization

被引：1

作者：

Jiao, Xiaolin ^{[1
]}

Chen, Yaqi ^{[2
]}

Qu, Dan ^{[2
]}

Yang, Xukui ^{[2
]}

机构：

[1] Zhengzhou Univ, Sch Cyber Sci & Engn, Zhengzhou 450001, Peoples R China

[2] Informat Engn Univ, Sch Informat Syst Engn, Zhengzhou 450001, Peoples R China

来源：

ELECTRONICS | 2023年 / 12卷 / 19期

基金：

中国国家自然科学基金;

关键词：

end-to-end neural diarization (EEND); blueprint separable convolution (BSConv); multi-scale feature aggregation (MFA); SPEAKER DIARIZATION; SEPARATION;

D O I：

10.3390/electronics12194118

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

At present, a prevalent approach to speaker diarization is clustering based on speaker embeddings. However, this method encounters two primary issues. Firstly, it cannot directly minimize the diarization error during the training process; secondly, the majority of clustering-based methods struggle to handle speaker overlap in audio. A viable approach for addressing these issues involves adopting end-to-end speaker diarization (EEND). Nevertheless, training this EEND system generally requires lengthy audio inputs, which must be downsampled to allow efficient model processing. In this study, we develop a novel downsampling layer using blueprint separable convolution (BSConv) instead of depthwise separable convolution (DSC) as the foundational convolutional unit, which effectively preserves information from the original audio. Furthermore, we incorporate multi-scale feature aggregation (MFA) into the encoder structure to combine the features extracted by each conformer block to the output layer, consequently enhancing the expressiveness of the model's feature extraction. Lastly, we employ the conformer as the backbone network to incorporate the proposed enhancements, resulting in an EEND system named BSAC-EEND. We assess our suggested methodology on both simulated and real datasets. The experiment indicates that our proposed EEND system reduces diarization error rate (DER) by an average of 17.3% for two-speaker datasets and 12.8% for three-speaker datasets compared to the baseline.

引用

页数：14

共 50 条

[21] Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech
Kinoshita, Keisuke
Delcroix, Marc
Tawara, Naohiro
INTERSPEECH 2021, 2021, : 3565 - 3569
[22] On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization
Huang, Yiling
Wang, Weiran
Zhao, Guanlong
Liao, Hank
Xia, Wei
Wang, Quan
INTERSPEECH 2024, 2024, : 32 - 36
[23] MUTUAL LEARNING OF SINGLE- AND MULTI-CHANNEL END-TO-END NEURAL DIARIZATION
Horiguchi, Shota
Takashima, Yuki
Watanabe, Shinji
Garcia, Paola
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 620 - 625
[24] Self-Conditioning via Intermediate Predictions for End-to-End Neural Speaker Diarization
Fujita, Yusuke
Ogawa, Tetsuji
Kobayashi, Tetsunori
IEEE ACCESS, 2023, 11 : 140069 - 140076
[25] From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization
Landini, Federico
Lozano-Diez, Alicia
Diez, Mireia
Burget, Lukas
INTERSPEECH 2022, 2022, : 5095 - 5099
[26] Semi-Supervised Training with Pseudo-Labeling for End-to-End Neural Diarization
Takashima, Yuki
Fujita, Yusuke
Horiguchi, Shota
Watanabe, Shinji
Garcia, Paola
Nagamatsu, Kenji
INTERSPEECH 2021, 2021, : 3096 - 3100
[27] End-to-End Neural Speaker Diarization with an Iterative Refinement of Non-Autoregressive Attention-based Attractors
Rybicka, Magdalena
Villalba, Jesus
Dehak, Najim
Kowalczyk, Konrad
INTERSPEECH 2022, 2022, : 5090 - 5094
[28] End-to-end Cooperative Localization via Neural Feature Sharing
Gao, Letian
Xiang, Hao
Xia, Xin
Ma, Jiaqi
2024 35TH IEEE INTELLIGENT VEHICLES SYMPOSIUM, IEEE IV 2024, 2024, : 553 - 558
[29] Key Frame Mechanism for Efficient Conformer Based End-to-End Speech Recognition
Fan, Peng
Shan, Changhao
Sun, Sining
Yang, Qing
Zhang, Jianwei
IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 1612 - 1616
[30] OVERLAP-AWARE DIARIZATION: RESEGMENTATION USING NEURAL END-TO-END OVERLAPPED SPEECH DETECTION
Bullock, Latane
Bredin, Herve
Garcia-Perera, Leibny Paola
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7114 - 7118

← 1 2 3 4 5 →