END-TO-END DIARIZATION FOR VARIABLE NUMBER OF SPEAKERS WITH LOCAL-GLOBAL NETWORKS AND DISCRIMINATIVE SPEAKER EMBEDDINGS

被引:13
|
作者
Maiti, Soumi [1 ,4 ]
Erdogan, Hakan [2 ]
Wilson, Kevin [2 ]
Wisdom, Scott [2 ]
Watanabe, Shinji [3 ]
Hershey, John R. [2 ]
机构
[1] CUNY, Grad Ctr, New York, NY 10010 USA
[2] Google Res, Mountain View, CA USA
[3] Johns Hopkins Univ, Baltimore, MD 21218 USA
[4] Google, Mountain View, CA 94043 USA
关键词
Diarization; attention; deep learning;
D O I
10.1109/ICASSP39728.2021.9414841
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings. End-to-end diarization models have the advantage of handling speaker overlap and enabling straightforward handling of discriminative training, unlike traditional clustering-based diarization methods. The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions. We introduce several components that appear to help with diarization performance, including a local convolutional network followed by a global self-attention module, multi-task transfer learning using a speaker identification component, and a sequential approach where the model is refined with a second stage. These are trained and validated on simulated meeting data based on LibriSpeech and LibriTTS datasets; final evaluations are done using LibriCSS, which consists of simulated meetings recorded using real acoustics via loudspeaker playback. The proposed model performs better than previously proposed end-to-end diarization models on these data.
引用
收藏
页码:7183 / 7187
页数:5
相关论文
共 50 条
  • [31] Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor
    Chen, Zhengyang
    Han, Bing
    Wang, Shuai
    Qian, Yanmin
    INTERSPEECH 2023, 2023, : 3552 - 3556
  • [32] Improved Relation Networks for End-to-End Speaker Verification and Identification
    Chaubey, Ashutosh
    Sinha, Sparsh
    Ghose, Susmita
    INTERSPEECH 2022, 2022, : 5085 - 5089
  • [33] Investigation of Training Mute-Expressive End-to-End Speech Separation Networks for an Unknown Number of Speakers
    Kim, Younggwan
    Lim, Hyungjun
    Yeom, Kiho
    Seo, Eunjoo
    Lee, Hoodong
    Choi, Stanley Jungkyu
    Lee, Honglak
    INTERSPEECH 2023, 2023, : 3764 - 3768
  • [34] On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization
    Huang, Yiling
    Wang, Weiran
    Zhao, Guanlong
    Liao, Hank
    Xia, Wei
    Wang, Quan
    INTERSPEECH 2024, 2024, : 32 - 36
  • [35] FRAME-LEVEL SPEAKER EMBEDDINGS FOR TEXT-INDEPENDENT SPEAKER RECOGNITION AND ANALYSIS OF END-TO-END MODEL
    Shon, Suwon
    Tang, Hao
    Glass, James
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 1007 - 1013
  • [36] GENERATIVE ADVERSARIAL SPEAKER EMBEDDING NETWORKS FOR DOMAIN ROBUST END-TO-END SPEAKER VERIFICATION
    Bhattacharya, Gautam
    Monteiro, Joao
    Alam, Jahangir
    Kenny, Patrick
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6226 - 6230
  • [37] Tied Hidden Factors in Neural Networks for End-to-End Speaker Recognition
    Miguel, Antonio
    Llombart, Jorge
    Ortega, Alfonso
    Lleida, Eduardo
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2819 - 2823
  • [38] Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings
    Yang, Chenyu
    Chen, Mengxi
    Wang, Yanfeng
    Wang, Yu
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4031 - 4041
  • [39] SPEAKER-AWARE TRAINING OF ATTENTION-BASED END-TO-END SPEECH RECOGNITION USING NEURAL SPEAKER EMBEDDINGS
    Rouhe, Aku
    Kaseva, Tuomas
    Kurimo, Mikko
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7064 - 7068
  • [40] End-to-End Neural Speaker Diarization with an Iterative Refinement of Non-Autoregressive Attention-based Attractors
    Rybicka, Magdalena
    Villalba, Jesus
    Dehak, Najim
    Kowalczyk, Konrad
    INTERSPEECH 2022, 2022, : 5090 - 5094