End-to-End Speaker-Attributed ASR with Transformer

被引:11
|
作者
Kanda, Naoyuki [1 ]
Ye, Guoli [1 ]
Gaur, Yashesh [1 ]
Wang, Xiaofei [1 ]
Meng, Zhong [1 ]
Chen, Zhuo [1 ]
Yoshioka, Takuya [1 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
来源
关键词
multi-speaker speech recognition; speaker counting; speaker identification; serialized output training; SPEECH RECOGNITION; DIARIZATION;
D O I
10.21437/Interspeech.2021-101
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper presents our recent effort on end-to-end speaker-attributed automatic speech recognition, which jointly performs speaker counting, speech recognition and speaker identification for monaural multi-talker audio. Firstly, we thoroughly update the model architecture that was previously designed based on a long short-term memory (LSTM)-based attention encoder decoder by applying transformer architectures. Secondly, we propose a speaker deduplication mechanism to reduce speaker identification errors in highly overlapped regions. Experimental results on the LibriSpeechMix dataset shows that the transformer-based architecture is especially good at counting the speakers and that the proposed model reduces the speakerattributed word error rate by 47% over the LSTM-based baseline. Furthermore, for the LibriCSS dataset, which consists of real recordings of overlapped speech, the proposed model achieves concatenated minimum-permutation word error rates of 11.9% and 16.3% with and without target speaker profiles, respectively, both of which are the state-of-the-art results for LibriCSS with the monaural setting.
引用
收藏
页码:4413 / 4417
页数:5
相关论文
共 50 条
  • [31] TOWARDS FAST AND ACCURATE STREAMING END-TO-END ASR
    Li, Bo
    Chang, Shuo-yiin
    Sainath, Tara N.
    Pang, Ruoming
    He, Yanzhang
    Strohman, Trevor
    Wu, Yonghui
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6069 - 6073
  • [32] Improving Performance of End-to-End ASR on Numeric Sequences
    Peyser, Cal
    Zhang, Hao
    Sainath, Tara N.
    Wu, Zelin
    INTERSPEECH 2019, 2019, : 2185 - 2189
  • [33] INDEPENDENT LANGUAGE MODELING ARCHITECTURE FOR END-TO-END ASR
    Van Tung Pham
    Xu, Haihua
    Khassanov, Yerbolat
    Zeng, Zhiping
    Chng, Eng Siong
    Ni, Chongjia
    Ma, Bin
    Li, Haizhou
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7059 - 7063
  • [34] A BETTER AND FASTER END-TO-END MODEL FOR STREAMING ASR
    Li, Bo
    Gulati, Anmol
    Yu, Jiahui
    Sainath, Tara N.
    Chiu, Chung-Cheng
    Narayanan, Arun
    Chang, Shuo-Yiin
    Pang, Ruoming
    He, Yanzhang
    Qin, James
    Han, Wei
    Liang, Qiao
    Zhang, Yu
    Strohman, Trevor
    Wu, Yonghui
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5634 - 5638
  • [35] SPEAKER ADAPTATION FOR END-TO-END CTC MODELS
    Li, Ke
    Li, Jinyu
    Zhao, Yong
    Kumar, Kshitiz
    Gong, Yifan
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 542 - 549
  • [36] GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION
    Wan, Li
    Wang, Quan
    Papir, Alan
    Moreno, Ignacio Lopez
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4879 - 4883
  • [37] END-TO-END MULTI-TALKER AUDIO-VISUAL ASR USING AN ACTIVE SPEAKER ATTENTION MODULE
    Rose, Richard
    Siohan, Olivier
    INTERSPEECH 2022, 2022, : 2828 - 2832
  • [38] TOWARDS END-TO-END SPEAKER DIARIZATION WITH GENERALIZED NEURAL SPEAKER CLUSTERING
    Zhang, Chunlei
    Shi, Jiatong
    Weng, Chao
    Yu, Meng
    Yu, Dong
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8372 - 8376
  • [39] A COMPARATIVE STUDY OF MODULAR AND JOINT APPROACHES FOR SPEAKER-ATTRIBUTED ASR ON MONAURAL LONG-FORM AUDIO
    Kanda, Naoyuki
    Xiao, Xiong
    Wu, Jian
    Zhou, Tianyan
    Gaur, Yashesh
    Wang, Xiaofei
    Meng, Zhong
    Chen, Zhuo
    Yoshioka, Takuya
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 296 - 303
  • [40] Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies
    Li, Zehan
    Miao, Haoran
    Deng, Keqi
    Cheng, Gaofeng
    Tian, Sanli
    Li, Ta
    Yan, Yonghong
    INTERSPEECH 2022, 2022, : 1671 - 1675