Blueprint Separable Subsampling and Aggregate Feature Conformer-Based End-to-End Neural Diarization

被引:1
|
作者
Jiao, Xiaolin [1 ]
Chen, Yaqi [2 ]
Qu, Dan [2 ]
Yang, Xukui [2 ]
机构
[1] Zhengzhou Univ, Sch Cyber Sci & Engn, Zhengzhou 450001, Peoples R China
[2] Informat Engn Univ, Sch Informat Syst Engn, Zhengzhou 450001, Peoples R China
基金
中国国家自然科学基金;
关键词
end-to-end neural diarization (EEND); blueprint separable convolution (BSConv); multi-scale feature aggregation (MFA); SPEAKER DIARIZATION; SEPARATION;
D O I
10.3390/electronics12194118
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
At present, a prevalent approach to speaker diarization is clustering based on speaker embeddings. However, this method encounters two primary issues. Firstly, it cannot directly minimize the diarization error during the training process; secondly, the majority of clustering-based methods struggle to handle speaker overlap in audio. A viable approach for addressing these issues involves adopting end-to-end speaker diarization (EEND). Nevertheless, training this EEND system generally requires lengthy audio inputs, which must be downsampled to allow efficient model processing. In this study, we develop a novel downsampling layer using blueprint separable convolution (BSConv) instead of depthwise separable convolution (DSC) as the foundational convolutional unit, which effectively preserves information from the original audio. Furthermore, we incorporate multi-scale feature aggregation (MFA) into the encoder structure to combine the features extracted by each conformer block to the output layer, consequently enhancing the expressiveness of the model's feature extraction. Lastly, we employ the conformer as the backbone network to incorporate the proposed enhancements, resulting in an EEND system named BSAC-EEND. We assess our suggested methodology on both simulated and real datasets. The experiment indicates that our proposed EEND system reduces diarization error rate (DER) by an average of 17.3% for two-speaker datasets and 12.8% for three-speaker datasets compared to the baseline.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] FRNet: an end-to-end feature refinement neural network for medical image segmentation
    Dan Wang
    Guoqing Hu
    Chengzhi Lyu
    The Visual Computer, 2021, 37 : 1101 - 1112
  • [42] EEND-SS: JOINT END-TO-END NEURAL SPEAKER DIARIZATION AND SPEECH SEPARATION FOR FLEXIBLE NUMBER OF SPEAKERS
    Maiti, Soumi
    Ueda, Yushi
    Watanabe, Shinji
    Zhang, Chunlei
    Yu, Meng
    Zhang, Shi-Xiong
    Xu, Yong
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 480 - 487
  • [43] End-to-end neural network based optimal quadcopter control
    Ferede, Robin
    de Croon, Guido
    De Wagter, Christophe
    Izzo, Dario
    ROBOTICS AND AUTONOMOUS SYSTEMS, 2024, 172
  • [44] An End-to-End Compression Framework Based on Convolutional Neural Networks
    Jiang, Feng
    Tao, Wen
    Liu, Shaohui
    Ren, Jie
    Guo, Xun
    Zhao, Debin
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2018, 28 (10) : 3007 - 3018
  • [45] End-to-End Speech Emotion Recognition Based on Neural Network
    Zhu, Bing
    Zhou, Wenkai
    Wang, Yutian
    Wang, Hui
    Cai, Juan Juan
    2017 17TH IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT 2017), 2017, : 1634 - 1638
  • [46] An End-to-End Compression Framework Based on Convolutional Neural Networks
    Tao, Wen
    Jiang, Feng
    Zhang, Shengping
    Ren, Jie
    Shi, Wuzhen
    Zuo, Wangmeng
    Guo, Xun
    Zhao, Debin
    2017 DATA COMPRESSION CONFERENCE (DCC), 2017, : 463 - 463
  • [47] End-to-End Neural Transformer Based Spoken Language Understanding
    Radfar, Martin
    Mouchtaris, Athanasios
    Kunzmann, Siegfried
    INTERSPEECH 2020, 2020, : 866 - 870
  • [48] END-TO-END NEURAL NETWORK BASED AUTOMATED SPEECH SCORING
    Chen, Lei
    Tao, Jidong
    Ghaffarzadegan, Shabnam
    Qian, Yao
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6234 - 6238
  • [49] An End-to-End Framework for Clothing Collocation Based on Semantic Feature Fusion
    Zhao, Mingbo
    Liu, Yu
    Li, Xianrui
    Zhang, Zhao
    Zhang, Yue
    IEEE MULTIMEDIA, 2020, 27 (04) : 122 - 132
  • [50] End-to-End Convolutional Neural Network Feature Extraction for Remote Sensed Images Classification
    Alem, Abebaw
    Kumar, Shailender
    APPLIED ARTIFICIAL INTELLIGENCE, 2022, 36 (01)