Self-Conditioning via Intermediate Predictions for End-to-End Neural Speaker Diarization

被引:0
|
作者
Fujita, Yusuke [1 ,2 ]
Ogawa, Tetsuji [2 ]
Kobayashi, Tetsunori [2 ]
机构
[1] LY Corp, Tokyo 1028282, Japan
[2] Waseda Univ, Dept Comp Sci & Commun Engn, Tokyo 1620042, Japan
关键词
Encoder-decoder-based attractors; end-to-end neural diarization; intermediate objectives; non-autoregressive models; self-conditioning; speaker diarization;
D O I
10.1109/ACCESS.2023.3340307
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a speaker diarization model that incorporates label dependency via intermediate predictions. The proposed method is categorized as an end-to-end neural diarization (EEND), which has been a promising method for solving the speaker diarization problem with a multi-label classification neural network. While most EEND-based models assume conditional independence between frame-level speaker labels, the proposed method introduces the label dependency to the models by exploiting the self-conditioning mechanism, which has been originally applied to an automatic speech recognition model. With the self-conditioning mechanism, speaker labels are iteratively refined by taking the whole sequence of intermediate speaker labels as a reference. We demonstrate the effectiveness of self-conditioning in both Transformer-based and attractor-based EEND models. To efficiently train the attractor-based EEND model, we propose an improved attractor computation module named non-autoregressive attractor, which produces speaker-wise attractors simultaneously in a non-autoregressive manner. The experiments with the CALLHOME two-speaker dataset show that the proposed self-conditioning boosts the diarization performance and progressively reduces errors through successive intermediate predictions. In addition, the proposed non-autoregressive attractor improves training efficiency and provides a synergetic boost with self-conditioning, leading to superior performance compared with existing diarization models.
引用
收藏
页码:140069 / 140076
页数:8
相关论文
共 50 条
  • [21] END-TO-END SPEAKER DIARIZATION CONDITIONED ON SPEECH ACTIVITY AND OVERLAP DETECTION
    Takashima, Yuki
    Fujita, Yusuke
    Watanabe, Shinji
    Horiguchi, Shota
    Garcia, Paola
    Nagamatsu, Kenji
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 849 - 856
  • [22] BW-EDA-EEND: STREAMING END-TO-END NEURAL SPEAKER DIARIZATION FOR A VARIABLE NUMBER OF SPEAKERS
    Han, Eunjung
    Lee, Chul
    Stolcke, Andreas
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7193 - 7197
  • [23] On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization
    Huang, Yiling
    Wang, Weiran
    Zhao, Guanlong
    Liao, Hank
    Xia, Wei
    Wang, Quan
    INTERSPEECH 2024, 2024, : 32 - 36
  • [24] TRANSCRIBE-TO-DIARIZE: NEURAL SPEAKER DIARIZATION FOR UNLIMITED NUMBER OF SPEAKERS USING END-TO-END SPEAKER-ATTRIBUTED ASR
    Kanda, Naoyuki
    Xiao, Xiong
    Gaur, Yashesh
    Wang, Xiaofei
    Meng, Zhong
    Chen, Zhuo
    Yoshioka, Takuya
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8082 - 8086
  • [25] EEND-SS: JOINT END-TO-END NEURAL SPEAKER DIARIZATION AND SPEECH SEPARATION FOR FLEXIBLE NUMBER OF SPEAKERS
    Maiti, Soumi
    Ueda, Yushi
    Watanabe, Shinji
    Zhang, Chunlei
    Yu, Meng
    Zhang, Shi-Xiong
    Xu, Yong
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 480 - 487
  • [26] End-to-End Neural Speaker Diarization with an Iterative Refinement of Non-Autoregressive Attention-based Attractors
    Rybicka, Magdalena
    Villalba, Jesus
    Dehak, Najim
    Kowalczyk, Konrad
    INTERSPEECH 2022, 2022, : 5090 - 5094
  • [27] A study on end-to-end speaker diarization system using single-label classification
    Jung, Jaehee
    Kim, Wooil
    JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2023, 42 (06): : 536 - 543
  • [28] Encoder-Decoder Based Attractors for End-to-End Neural Diarization
    Horiguchi, Shota
    Fujita, Yusuke
    Watanabe, Shinji
    Xue, Yawen
    Garcia, Paola
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1493 - 1507
  • [29] Neural PLDA Modeling for End-to-End Speaker Verification
    Ramoji, Shreyas
    Krishnan, Prashant
    Ganapathy, Sriram
    INTERSPEECH 2020, 2020, : 4333 - 4337
  • [30] Improving End-to-End Neural Diarization Using Conversational Summary Representations
    Broughton, Samuel J.
    Samarakoon, Lahiru
    INTERSPEECH 2023, 2023, : 3157 - 3161