Blueprint Separable Subsampling and Aggregate Feature Conformer-Based End-to-End Neural Diarization

被引:1
|
作者
Jiao, Xiaolin [1 ]
Chen, Yaqi [2 ]
Qu, Dan [2 ]
Yang, Xukui [2 ]
机构
[1] Zhengzhou Univ, Sch Cyber Sci & Engn, Zhengzhou 450001, Peoples R China
[2] Informat Engn Univ, Sch Informat Syst Engn, Zhengzhou 450001, Peoples R China
基金
中国国家自然科学基金;
关键词
end-to-end neural diarization (EEND); blueprint separable convolution (BSConv); multi-scale feature aggregation (MFA); SPEAKER DIARIZATION; SEPARATION;
D O I
10.3390/electronics12194118
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
At present, a prevalent approach to speaker diarization is clustering based on speaker embeddings. However, this method encounters two primary issues. Firstly, it cannot directly minimize the diarization error during the training process; secondly, the majority of clustering-based methods struggle to handle speaker overlap in audio. A viable approach for addressing these issues involves adopting end-to-end speaker diarization (EEND). Nevertheless, training this EEND system generally requires lengthy audio inputs, which must be downsampled to allow efficient model processing. In this study, we develop a novel downsampling layer using blueprint separable convolution (BSConv) instead of depthwise separable convolution (DSC) as the foundational convolutional unit, which effectively preserves information from the original audio. Furthermore, we incorporate multi-scale feature aggregation (MFA) into the encoder structure to combine the features extracted by each conformer block to the output layer, consequently enhancing the expressiveness of the model's feature extraction. Lastly, we employ the conformer as the backbone network to incorporate the proposed enhancements, resulting in an EEND system named BSAC-EEND. We assess our suggested methodology on both simulated and real datasets. The experiment indicates that our proposed EEND system reduces diarization error rate (DER) by an average of 17.3% for two-speaker datasets and 12.8% for three-speaker datasets compared to the baseline.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers
    Xue, Yawen
    Horiguchi, Shota
    Fujita, Yusuke
    Takashima, Yuki
    Watanabe, Shinji
    Garcia, Paola
    Nagamatsu, Kenji
    INTERSPEECH 2021, 2021, : 3116 - 3120
  • [32] Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor
    Chen, Zhengyang
    Han, Bing
    Wang, Shuai
    Qian, Yanmin
    INTERSPEECH 2023, 2023, : 3552 - 3556
  • [33] BW-EDA-EEND: STREAMING END-TO-END NEURAL SPEAKER DIARIZATION FOR A VARIABLE NUMBER OF SPEAKERS
    Han, Eunjung
    Lee, Chul
    Stolcke, Andreas
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7193 - 7197
  • [34] Self-Distillation into Self-Attention Heads for Improving Transformer-based End-to-End Neural Speaker Diarization
    Jeoung, Ye-Rin
    Choi, Jeong-Hwan
    Seong, Ju-Seok
    Kyung, JeHyun
    Chang, Joon-Hyuk
    INTERSPEECH 2023, 2023, : 3197 - 3201
  • [35] An efficient end-to-end feature based system for SAR ATR
    Pham, QH
    Brosnan, TM
    Smith, MJT
    Mersereau, RM
    ALGORITHMS FOR SYNTHETIC APERTURE RADAR IMAGERY V, 1998, 3370 : 519 - 529
  • [36] End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors
    Horiguchi, Shota
    Fujita, Yusuke
    Watanabe, Shinji
    Xue, Yawen
    Nagamatsu, Kenji
    INTERSPEECH 2020, 2020, : 269 - 273
  • [37] Auxiliary feature based adaptation of end-to-end ASR systems
    Delcroix, Marc
    Watanabe, Shinji
    Ogawa, Atsunori
    Karita, Shigeki
    Nakatani, Tomohiro
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2444 - 2448
  • [38] An End-to-End Rumor Detection Model Based on Feature Aggregation
    Ye, Aoshuang
    Wang, Lina
    Wang, Run
    Wang, Wenqi
    Ke, Jianpeng
    Wang, Danlei
    COMPLEXITY, 2021, 2021
  • [39] FRNet: an end-to-end feature refinement neural network for medical image segmentation
    Wang, Dan
    Hu, Guoqing
    Lyu, Chengzhi
    VISUAL COMPUTER, 2021, 37 (05): : 1101 - 1112
  • [40] STREAMING END-TO-END SPEECH RECOGNITION WITH JOINTLY TRAINED NEURAL FEATURE ENHANCEMENT
    Kim, Chanwoo
    Garg, Abhinav
    Gowda, Dhananjaya
    Mun, Seongkyu
    Han, Changwoo
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6773 - 6777