Blueprint Separable Subsampling and Aggregate Feature Conformer-Based End-to-End Neural Diarization

被引：1

作者：

Jiao, Xiaolin ^{[1
]}

Chen, Yaqi ^{[2
]}

Qu, Dan ^{[2
]}

Yang, Xukui ^{[2
]}

机构：

[1] Zhengzhou Univ, Sch Cyber Sci & Engn, Zhengzhou 450001, Peoples R China

[2] Informat Engn Univ, Sch Informat Syst Engn, Zhengzhou 450001, Peoples R China

来源：

ELECTRONICS | 2023年 / 12卷 / 19期

基金：

中国国家自然科学基金;

关键词：

end-to-end neural diarization (EEND); blueprint separable convolution (BSConv); multi-scale feature aggregation (MFA); SPEAKER DIARIZATION; SEPARATION;

D O I：

10.3390/electronics12194118

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

At present, a prevalent approach to speaker diarization is clustering based on speaker embeddings. However, this method encounters two primary issues. Firstly, it cannot directly minimize the diarization error during the training process; secondly, the majority of clustering-based methods struggle to handle speaker overlap in audio. A viable approach for addressing these issues involves adopting end-to-end speaker diarization (EEND). Nevertheless, training this EEND system generally requires lengthy audio inputs, which must be downsampled to allow efficient model processing. In this study, we develop a novel downsampling layer using blueprint separable convolution (BSConv) instead of depthwise separable convolution (DSC) as the foundational convolutional unit, which effectively preserves information from the original audio. Furthermore, we incorporate multi-scale feature aggregation (MFA) into the encoder structure to combine the features extracted by each conformer block to the output layer, consequently enhancing the expressiveness of the model's feature extraction. Lastly, we employ the conformer as the backbone network to incorporate the proposed enhancements, resulting in an EEND system named BSAC-EEND. We assess our suggested methodology on both simulated and real datasets. The experiment indicates that our proposed EEND system reduces diarization error rate (DER) by an average of 17.3% for two-speaker datasets and 12.8% for three-speaker datasets compared to the baseline.

引用

页数：14

共 50 条

[31] Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers
Xue, Yawen
Horiguchi, Shota
Fujita, Yusuke
Takashima, Yuki
Watanabe, Shinji
Garcia, Paola
Nagamatsu, Kenji
INTERSPEECH 2021, 2021, : 3116 - 3120
[32] Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor
Chen, Zhengyang
Han, Bing
Wang, Shuai
Qian, Yanmin
INTERSPEECH 2023, 2023, : 3552 - 3556
[33] BW-EDA-EEND: STREAMING END-TO-END NEURAL SPEAKER DIARIZATION FOR A VARIABLE NUMBER OF SPEAKERS
Han, Eunjung
Lee, Chul
Stolcke, Andreas
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7193 - 7197
[34] Self-Distillation into Self-Attention Heads for Improving Transformer-based End-to-End Neural Speaker Diarization
Jeoung, Ye-Rin
Choi, Jeong-Hwan
Seong, Ju-Seok
Kyung, JeHyun
Chang, Joon-Hyuk
INTERSPEECH 2023, 2023, : 3197 - 3201
[35] An efficient end-to-end feature based system for SAR ATR
Pham, QH
Brosnan, TM
Smith, MJT
Mersereau, RM
ALGORITHMS FOR SYNTHETIC APERTURE RADAR IMAGERY V, 1998, 3370 : 519 - 529
[36] End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors
Horiguchi, Shota
Fujita, Yusuke
Watanabe, Shinji
Xue, Yawen
Nagamatsu, Kenji
INTERSPEECH 2020, 2020, : 269 - 273
[37] Auxiliary feature based adaptation of end-to-end ASR systems
Delcroix, Marc
Watanabe, Shinji
Ogawa, Atsunori
Karita, Shigeki
Nakatani, Tomohiro
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2444 - 2448
[38] An End-to-End Rumor Detection Model Based on Feature Aggregation
Ye, Aoshuang
Wang, Lina
Wang, Run
Wang, Wenqi
Ke, Jianpeng
Wang, Danlei
COMPLEXITY, 2021, 2021
[39] FRNet: an end-to-end feature refinement neural network for medical image segmentation
Wang, Dan
Hu, Guoqing
Lyu, Chengzhi
VISUAL COMPUTER, 2021, 37 (05): : 1101 - 1112
[40] STREAMING END-TO-END SPEECH RECOGNITION WITH JOINTLY TRAINED NEURAL FEATURE ENHANCEMENT
Kim, Chanwoo
Garg, Abhinav
Gowda, Dhananjaya
Mun, Seongkyu
Han, Changwoo
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6773 - 6777

← 1 2 3 4 5 →