Self-Conditioning via Intermediate Predictions for End-to-End Neural Speaker Diarization

被引:0
|
作者
Fujita, Yusuke [1 ,2 ]
Ogawa, Tetsuji [2 ]
Kobayashi, Tetsunori [2 ]
机构
[1] LY Corp, Tokyo 1028282, Japan
[2] Waseda Univ, Dept Comp Sci & Commun Engn, Tokyo 1620042, Japan
关键词
Encoder-decoder-based attractors; end-to-end neural diarization; intermediate objectives; non-autoregressive models; self-conditioning; speaker diarization;
D O I
10.1109/ACCESS.2023.3340307
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a speaker diarization model that incorporates label dependency via intermediate predictions. The proposed method is categorized as an end-to-end neural diarization (EEND), which has been a promising method for solving the speaker diarization problem with a multi-label classification neural network. While most EEND-based models assume conditional independence between frame-level speaker labels, the proposed method introduces the label dependency to the models by exploiting the self-conditioning mechanism, which has been originally applied to an automatic speech recognition model. With the self-conditioning mechanism, speaker labels are iteratively refined by taking the whole sequence of intermediate speaker labels as a reference. We demonstrate the effectiveness of self-conditioning in both Transformer-based and attractor-based EEND models. To efficiently train the attractor-based EEND model, we propose an improved attractor computation module named non-autoregressive attractor, which produces speaker-wise attractors simultaneously in a non-autoregressive manner. The experiments with the CALLHOME two-speaker dataset show that the proposed self-conditioning boosts the diarization performance and progressively reduces errors through successive intermediate predictions. In addition, the proposed non-autoregressive attractor improves training efficiency and provides a synergetic boost with self-conditioning, leading to superior performance compared with existing diarization models.
引用
收藏
页码:140069 / 140076
页数:8
相关论文
共 50 条
  • [41] Tied Hidden Factors in Neural Networks for End-to-End Speaker Recognition
    Miguel, Antonio
    Llombart, Jorge
    Ortega, Alfonso
    Lleida, Eduardo
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2819 - 2823
  • [42] Two-Microphone End-to-End Speaker Joint Identification and Localization Via Convolutional Neural Networks
    Salvati, Daniele
    Drioli, Carlo
    Foresti, Gian Luca
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [43] End-to-End Chinese Speaker Identification
    Yu, Dian
    Zhou, Ben
    Yu, Dong
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 2274 - 2285
  • [44] End-to-End Active Speaker Detection
    Alcazar, Juan Leon
    Cordes, Moritz
    Zhao, Chen
    Ghanem, Bernard
    COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 126 - 143
  • [45] Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings
    Yang, Chenyu
    Chen, Mengxi
    Wang, Yanfeng
    Wang, Yu
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4031 - 4041
  • [46] OVERLAP-AWARE LOW-LATENCY ONLINE SPEAKER DIARIZATION BASED ON END-TO-END LOCAL SEGMENTATION
    Coria, Juan M.
    Bredin, Herve
    Ghannay, Sahar
    Rosset, Sophie
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 1139 - 1146
  • [47] INTEGRATING END-TO-END NEURAL AND CLUSTERING-BASED DIARIZATION: GETTING THE BEST OF BOTH WORLDS
    Kinoshita, Keisuke
    Delcroix, Marc
    Tawara, Naohiro
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7198 - 7202
  • [48] Blueprint Separable Subsampling and Aggregate Feature Conformer-Based End-to-End Neural Diarization
    Jiao, Xiaolin
    Chen, Yaqi
    Qu, Dan
    Yang, Xukui
    ELECTRONICS, 2023, 12 (19)
  • [49] Attention-Based Encoder-Decoder End-to-End Neural Diarization With Embedding Enhancer
    Chen, Zhengyang
    Han, Bing
    Wang, Shuai
    Qian, Yanmin
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1636 - 1649
  • [50] OVERLAP-AWARE DIARIZATION: RESEGMENTATION USING NEURAL END-TO-END OVERLAPPED SPEECH DETECTION
    Bullock, Latane
    Bredin, Herve
    Garcia-Perera, Leibny Paola
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7114 - 7118