Self-Conditioning via Intermediate Predictions for End-to-End Neural Speaker Diarization

被引：0

作者：

Fujita, Yusuke ^{[1
,2
]}

Ogawa, Tetsuji ^{[2
]}

Kobayashi, Tetsunori ^{[2
]}

机构：

[1] LY Corp, Tokyo 1028282, Japan

[2] Waseda Univ, Dept Comp Sci & Commun Engn, Tokyo 1620042, Japan

来源：

IEEE ACCESS | 2023年 / 11卷

关键词：

Encoder-decoder-based attractors; end-to-end neural diarization; intermediate objectives; non-autoregressive models; self-conditioning; speaker diarization;

D O I：

10.1109/ACCESS.2023.3340307

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper presents a speaker diarization model that incorporates label dependency via intermediate predictions. The proposed method is categorized as an end-to-end neural diarization (EEND), which has been a promising method for solving the speaker diarization problem with a multi-label classification neural network. While most EEND-based models assume conditional independence between frame-level speaker labels, the proposed method introduces the label dependency to the models by exploiting the self-conditioning mechanism, which has been originally applied to an automatic speech recognition model. With the self-conditioning mechanism, speaker labels are iteratively refined by taking the whole sequence of intermediate speaker labels as a reference. We demonstrate the effectiveness of self-conditioning in both Transformer-based and attractor-based EEND models. To efficiently train the attractor-based EEND model, we propose an improved attractor computation module named non-autoregressive attractor, which produces speaker-wise attractors simultaneously in a non-autoregressive manner. The experiments with the CALLHOME two-speaker dataset show that the proposed self-conditioning boosts the diarization performance and progressively reduces errors through successive intermediate predictions. In addition, the proposed non-autoregressive attractor improves training efficiency and provides a synergetic boost with self-conditioning, leading to superior performance compared with existing diarization models.

引用

页码：140069 / 140076

页数：8

共 50 条

[41] Tied Hidden Factors in Neural Networks for End-to-End Speaker Recognition
Miguel, Antonio
Llombart, Jorge
Ortega, Alfonso
Lleida, Eduardo
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2819 - 2823
[42] Two-Microphone End-to-End Speaker Joint Identification and Localization Via Convolutional Neural Networks
Salvati, Daniele
Drioli, Carlo
Foresti, Gian Luca
2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
[43] End-to-End Chinese Speaker Identification
Yu, Dian
Zhou, Ben
Yu, Dong
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 2274 - 2285
[44] End-to-End Active Speaker Detection
Alcazar, Juan Leon
Cordes, Moritz
Zhao, Chen
Ghanem, Bernard
COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 126 - 143
[45] Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings
Yang, Chenyu
Chen, Mengxi
Wang, Yanfeng
Wang, Yu
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4031 - 4041
[46] OVERLAP-AWARE LOW-LATENCY ONLINE SPEAKER DIARIZATION BASED ON END-TO-END LOCAL SEGMENTATION
Coria, Juan M.
Bredin, Herve
Ghannay, Sahar
Rosset, Sophie
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 1139 - 1146
[47] INTEGRATING END-TO-END NEURAL AND CLUSTERING-BASED DIARIZATION: GETTING THE BEST OF BOTH WORLDS
Kinoshita, Keisuke
Delcroix, Marc
Tawara, Naohiro
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7198 - 7202
[48] Blueprint Separable Subsampling and Aggregate Feature Conformer-Based End-to-End Neural Diarization
Jiao, Xiaolin
Chen, Yaqi
Qu, Dan
Yang, Xukui
ELECTRONICS, 2023, 12 (19)
[49] Attention-Based Encoder-Decoder End-to-End Neural Diarization With Embedding Enhancer
Chen, Zhengyang
Han, Bing
Wang, Shuai
Qian, Yanmin
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1636 - 1649
[50] OVERLAP-AWARE DIARIZATION: RESEGMENTATION USING NEURAL END-TO-END OVERLAPPED SPEECH DETECTION
Bullock, Latane
Bredin, Herve
Garcia-Perera, Leibny Paola
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7114 - 7118

← 1 2 3 4 5 →