Self-Conditioning via Intermediate Predictions for End-to-End Neural Speaker Diarization

被引:0
|
作者
Fujita, Yusuke [1 ,2 ]
Ogawa, Tetsuji [2 ]
Kobayashi, Tetsunori [2 ]
机构
[1] LY Corp, Tokyo 1028282, Japan
[2] Waseda Univ, Dept Comp Sci & Commun Engn, Tokyo 1620042, Japan
关键词
Encoder-decoder-based attractors; end-to-end neural diarization; intermediate objectives; non-autoregressive models; self-conditioning; speaker diarization;
D O I
10.1109/ACCESS.2023.3340307
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a speaker diarization model that incorporates label dependency via intermediate predictions. The proposed method is categorized as an end-to-end neural diarization (EEND), which has been a promising method for solving the speaker diarization problem with a multi-label classification neural network. While most EEND-based models assume conditional independence between frame-level speaker labels, the proposed method introduces the label dependency to the models by exploiting the self-conditioning mechanism, which has been originally applied to an automatic speech recognition model. With the self-conditioning mechanism, speaker labels are iteratively refined by taking the whole sequence of intermediate speaker labels as a reference. We demonstrate the effectiveness of self-conditioning in both Transformer-based and attractor-based EEND models. To efficiently train the attractor-based EEND model, we propose an improved attractor computation module named non-autoregressive attractor, which produces speaker-wise attractors simultaneously in a non-autoregressive manner. The experiments with the CALLHOME two-speaker dataset show that the proposed self-conditioning boosts the diarization performance and progressively reduces errors through successive intermediate predictions. In addition, the proposed non-autoregressive attractor improves training efficiency and provides a synergetic boost with self-conditioning, leading to superior performance compared with existing diarization models.
引用
收藏
页码:140069 / 140076
页数:8
相关论文
共 50 条
  • [31] MULTI-CHANNEL END-TO-END NEURAL DIARIZATION WITH DISTRIBUTED MICROPHONES
    Horiguchi, Shota
    Takashima, Yuki
    Garcia, Paola
    Watanabe, Shinji
    Kawaguchi, Yohei
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7332 - 7336
  • [32] DiaPer: End-to-End Neural Diarization With Perceiver-Based Attractors
    Landini, Federico
    Diez, Mireia
    Stafylakis, Themos
    Burget, Lukas
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3450 - 3465
  • [33] End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors
    Horiguchi, Shota
    Fujita, Yusuke
    Watanabe, Shinji
    Xue, Yawen
    Nagamatsu, Kenji
    INTERSPEECH 2020, 2020, : 269 - 273
  • [34] End-To-End Phonetic Neural Network Approach for Speaker Verification
    Demirbag, Sedat
    Erden, Mustafa
    Arslan, Levent
    2020 28TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2020,
  • [35] CASA-Net: Cross-attention and Self-attention for End-to-End Audio-visual Speaker Diarization
    Zhou, Haodong
    Li, Tao
    Wang, Jie
    Li, Lin
    Hong, Qingyang
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 102 - 106
  • [36] DEEP NEURAL NETWORK-BASED SPEAKER EMBEDDINGS FOR END-TO-END SPEAKER VERIFICATION
    Snyder, David
    Ghahremani, Pegah
    Povey, Daniel
    Garcia-Romero, Daniel
    Carmiel, Yishay
    Khudanpur, Sanjeev
    2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 165 - 170
  • [37] END-TO-END DIARIZATION FOR VARIABLE NUMBER OF SPEAKERS WITH LOCAL-GLOBAL NETWORKS AND DISCRIMINATIVE SPEAKER EMBEDDINGS
    Maiti, Soumi
    Erdogan, Hakan
    Wilson, Kevin
    Wisdom, Scott
    Watanabe, Shinji
    Hershey, John R.
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7183 - 7187
  • [38] MUTUAL LEARNING OF SINGLE- AND MULTI-CHANNEL END-TO-END NEURAL DIARIZATION
    Horiguchi, Shota
    Takashima, Yuki
    Watanabe, Shinji
    Garcia, Paola
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 620 - 625
  • [39] From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization
    Landini, Federico
    Lozano-Diez, Alicia
    Diez, Mireia
    Burget, Lukas
    INTERSPEECH 2022, 2022, : 5095 - 5099
  • [40] Semi-Supervised Training with Pseudo-Labeling for End-to-End Neural Diarization
    Takashima, Yuki
    Fujita, Yusuke
    Horiguchi, Shota
    Watanabe, Shinji
    Garcia, Paola
    Nagamatsu, Kenji
    INTERSPEECH 2021, 2021, : 3096 - 3100