End-to-End Speaker-Attributed ASR with Transformer

被引:11
|
作者
Kanda, Naoyuki [1 ]
Ye, Guoli [1 ]
Gaur, Yashesh [1 ]
Wang, Xiaofei [1 ]
Meng, Zhong [1 ]
Chen, Zhuo [1 ]
Yoshioka, Takuya [1 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
来源
关键词
multi-speaker speech recognition; speaker counting; speaker identification; serialized output training; SPEECH RECOGNITION; DIARIZATION;
D O I
10.21437/Interspeech.2021-101
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper presents our recent effort on end-to-end speaker-attributed automatic speech recognition, which jointly performs speaker counting, speech recognition and speaker identification for monaural multi-talker audio. Firstly, we thoroughly update the model architecture that was previously designed based on a long short-term memory (LSTM)-based attention encoder decoder by applying transformer architectures. Secondly, we propose a speaker deduplication mechanism to reduce speaker identification errors in highly overlapped regions. Experimental results on the LibriSpeechMix dataset shows that the transformer-based architecture is especially good at counting the speakers and that the proposed model reduces the speakerattributed word error rate by 47% over the LSTM-based baseline. Furthermore, for the LibriCSS dataset, which consists of real recordings of overlapped speech, the proposed model achieves concatenated minimum-permutation word error rates of 11.9% and 16.3% with and without target speaker profiles, respectively, both of which are the state-of-the-art results for LibriCSS with the monaural setting.
引用
收藏
页码:4413 / 4417
页数:5
相关论文
共 50 条
  • [41] Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning
    Zeng, Zhiping
    Pham, Van Tung
    Xu, Haihua
    Khassanov, Yerbolat
    Chng, Eng Siong
    Ni, Chongjia
    Ma, Bin
    2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
  • [42] Developing State-of-the-Art End-to-End ASR for Norwegian
    Nouza, Jan
    Mateju, Lukas
    Cerva, Petr
    Zdansky, Jindrich
    TEXT, SPEECH, AND DIALOGUE, TSD 2023, 2023, 14102 : 200 - 213
  • [43] Comparison and Analysis of New Curriculum Criteria for End-to-End ASR
    Karakasidis, Georgios
    Grosz, Tamas
    Kurimo, Mikko
    INTERSPEECH 2022, 2022, : 66 - 70
  • [44] Joint Grapheme and Phoneme Embeddings for Contextual End-to-End ASR
    Chen, Zhehuai
    Jain, Mahaveer
    Wang, Yongqiang
    Seltzer, Michael L.
    Fuegen, Christian
    INTERSPEECH 2019, 2019, : 3490 - 3494
  • [45] BILINGUAL END-TO-END ASR WITH BYTE-LEVEL SUBWORDS
    Deng, Liuhui
    Hsiao, Roger
    Ghoshal, Arnab
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6417 - 6421
  • [46] Comparison and analysis of new curriculum criteria for end-to-end ASR
    Karakasidis, Georgios
    Kurimo, Mikko
    Bell, Peter
    Grosz, Tamas
    SPEECH COMMUNICATION, 2024, 163
  • [47] End-to-end ASR to jointly predict transcriptions and linguistic annotations
    Omachi, Motoi
    Fujita, Yuya
    Watanabe, Shinji
    Wiesner, Matthew
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 1861 - 1871
  • [48] Multi-Modal Data Augmentation for End-to-End ASR
    Renduchintala, Adithya
    Ding, Shuoyang
    Wiesner, Matthew
    Watanabe, Shinji
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2394 - 2398
  • [49] Data Augmentation Using CycleGAN for End-to-End Children ASR
    Singh, Dipesh K.
    Amin, Preet P.
    Sailor, Hardik B.
    Patil, Hemant A.
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 511 - 515
  • [50] Iterative Compression of End-to-End ASR Model using AutoML
    Mehrotra, Abhinav
    Dudziak, Lukasz
    Yeo, Jinsu
    Lee, Young-yoon
    Vipperla, Ravichander
    Abdelfattah, Mohamed S.
    Bhattacharya, Sourav
    Ishtiaq, Samin
    Ramos, Alberto Gil C. P.
    Lee, SangJeong
    Kim, Daehyun
    Lane, Nicholas D.
    INTERSPEECH 2020, 2020, : 3361 - 3365