END-TO-END DIARIZATION FOR VARIABLE NUMBER OF SPEAKERS WITH LOCAL-GLOBAL NETWORKS AND DISCRIMINATIVE SPEAKER EMBEDDINGS

被引：13

作者：

Maiti, Soumi ^{[1
,4
]}

Erdogan, Hakan ^{[2
]}

Wilson, Kevin ^{[2
]}

Wisdom, Scott ^{[2
]}

Watanabe, Shinji ^{[3
]}

Hershey, John R. ^{[2
]}

机构：

[1] CUNY, Grad Ctr, New York, NY 10010 USA

[2] Google Res, Mountain View, CA USA

[3] Johns Hopkins Univ, Baltimore, MD 21218 USA

[4] Google, Mountain View, CA 94043 USA

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

Diarization; attention; deep learning;

D O I：

10.1109/ICASSP39728.2021.9414841

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. End-to-end diarization models have the advantage of handling speaker overlap and enabling straightforward handling of discriminative training, unlike traditional clustering-based diarization methods. The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions. We introduce several components that appear to help with diarization performance, including a local convolutional network followed by a global self-attention module, multi-task transfer learning using a speaker identification component, and a sequential approach where the model is refined with a second stage. These are trained and validated on simulated meeting data based on LibriSpeech and LibriTTS datasets; final evaluations are done using LibriCSS, which consists of simulated meetings recorded using real acoustics via loudspeaker playback. The proposed model performs better than previously proposed end-to-end diarization models on these data.

引用

页码：7183 / 7187

页数：5

共 50 条

[31] Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor
Chen, Zhengyang
Han, Bing
Wang, Shuai
Qian, Yanmin
INTERSPEECH 2023, 2023, : 3552 - 3556
[32] Improved Relation Networks for End-to-End Speaker Verification and Identification
Chaubey, Ashutosh
Sinha, Sparsh
Ghose, Susmita
INTERSPEECH 2022, 2022, : 5085 - 5089
[33] Investigation of Training Mute-Expressive End-to-End Speech Separation Networks for an Unknown Number of Speakers
Kim, Younggwan
Lim, Hyungjun
Yeom, Kiho
Seo, Eunjoo
Lee, Hoodong
Choi, Stanley Jungkyu
Lee, Honglak
INTERSPEECH 2023, 2023, : 3764 - 3768
[34] On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization
Huang, Yiling
Wang, Weiran
Zhao, Guanlong
Liao, Hank
Xia, Wei
Wang, Quan
INTERSPEECH 2024, 2024, : 32 - 36
[35] FRAME-LEVEL SPEAKER EMBEDDINGS FOR TEXT-INDEPENDENT SPEAKER RECOGNITION AND ANALYSIS OF END-TO-END MODEL
Shon, Suwon
Tang, Hao
Glass, James
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 1007 - 1013
[36] GENERATIVE ADVERSARIAL SPEAKER EMBEDDING NETWORKS FOR DOMAIN ROBUST END-TO-END SPEAKER VERIFICATION
Bhattacharya, Gautam
Monteiro, Joao
Alam, Jahangir
Kenny, Patrick
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6226 - 6230
[37] Tied Hidden Factors in Neural Networks for End-to-End Speaker Recognition
Miguel, Antonio
Llombart, Jorge
Ortega, Alfonso
Lleida, Eduardo
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2819 - 2823
[38] Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings
Yang, Chenyu
Chen, Mengxi
Wang, Yanfeng
Wang, Yu
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4031 - 4041
[39] SPEAKER-AWARE TRAINING OF ATTENTION-BASED END-TO-END SPEECH RECOGNITION USING NEURAL SPEAKER EMBEDDINGS
Rouhe, Aku
Kaseva, Tuomas
Kurimo, Mikko
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7064 - 7068
[40] End-to-End Neural Speaker Diarization with an Iterative Refinement of Non-Autoregressive Attention-based Attractors
Rybicka, Magdalena
Villalba, Jesus
Dehak, Najim
Kowalczyk, Konrad
INTERSPEECH 2022, 2022, : 5090 - 5094

← 1 2 3 4 5 →