Improving End-to-End Single-Channel Multi-Talker Speech Recognition

被引:22
|
作者
Zhang, Wangyou [1 ,2 ]
Chang, Xuankai [3 ]
Qian, Yanmin [1 ,2 ]
Watanabe, Shinji [3 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, SpeechLab, Shanghai 200240, Peoples R China
[2] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, AI Inst, Shanghai 200240, Peoples R China
[3] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
关键词
Speech recognition; Training; Hidden Markov models; Decoding; Speech enhancement; Computational modeling; Multi-talker mixed speech recognition; permutation invariant training; end-to-end model; knowledge distillation; curriculum learning; DEEP NEURAL-NETWORKS; SEPARATION; ENHANCEMENT; ARCHITECTURES; COMPRESSION; PROGRESS;
D O I
10.1109/TASLP.2020.2988423
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Although significant progress has been made in single-talker automatic speech recognition (ASR), there is still a large performance gap between multi-talker and single-talker speech recognition systems. In this article, we propose an enhanced end-to-end monaural multi-talker ASR architecture and training strategy to recognize the overlapped speech. The single-talker end-to-end model is extended to a multi-talker architecture with permutation invariant training (PIT). Several methods are designed to enhance the system performance, including speaker parallel attention, scheduled sampling, curriculum learning and knowledge distillation. More specifically, the speaker parallel attention extends the basic single shared attention module into multiple attention modules for each speaker, which can enhance the tracing and separation ability. Then the scheduled sampling and curriculum learning are proposed to make the model better optimized. Finally the knowledge distillation transfers the knowledge from an original single-speaker model to the current multi-speaker model in the proposed end-to-end multi-talker ASR structure. Our proposed architectures are evaluated and compared on the artificially mixed speech datasets generated from the WSJ0 reading corpus. The experiments demonstrate that our proposed architectures can significantly improve the multi-talker mixed speech recognition. The final system obtains more than 15% relative performance gains in both character error rate (CER) and word error rate (WER) compared to the basic end-to-end multi-talker ASR system.
引用
收藏
页码:1385 / 1394
页数:10
相关论文
共 50 条
  • [1] END-TO-END MULTI-TALKER OVERLAPPING SPEECH RECOGNITION
    Tripathi, Anshuman
    Lu, Han
    Sak, Hasim
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6129 - 6133
  • [2] Streaming End-to-End Multi-Talker Speech Recognition
    Lu, Liang
    Kanda, Naoyuki
    Li, Jinyu
    Gong, Yifan
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 803 - 807
  • [3] Unsupervised Domain Adaptation on End-to-End Multi-Talker Overlapped Speech Recognition
    Zheng, Lin
    Zhu, Han
    Tian, Sanli
    Zhao, Qingwei
    Li, Ta
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 3119 - 3123
  • [4] Single-channel multi-talker speech recognition with permutation invariant training
    Qian, Yanmin
    Chang, Xuankai
    Yu, Dong
    SPEECH COMMUNICATION, 2018, 104 : 1 - 11
  • [5] Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition
    Weng, Chao
    Yu, Dong
    Seltzer, Michael L.
    Droppo, Jasha
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (10) : 1670 - 1679
  • [6] Exploiting Single-Channel Speech for Multi-Channel End-to-End Speech Recognition: A Comparative Study
    An, Keyu
    Xiao, Ji
    Ou, Zhijian
    2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 180 - 184
  • [7] End-to-End Brain-Driven Speech Enhancement in Multi-Talker Conditions
    Hosseini, Maryam
    Celotti, Luca
    Plourde, Eric
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1718 - 1733
  • [8] KNOWLEDGE TRANSFER IN PERMUTATION INVARIANT TRAINING FOR SINGLE-CHANNEL MULTI-TALKER SPEECH RECOGNITION
    Tan, Tian
    Qian, Yanmin
    Yu, Dong
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5714 - 5718
  • [9] ENDPOINT DETECTION FOR STREAMING END-TO-END MULTI-TALKER ASR
    Lu, Liang
    Li, Jinyu
    Gong, Yifan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7312 - 7316
  • [10] Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction
    Makishima, Naoki
    Suzuki, Keita
    Suzuki, Satoshi
    Ando, Atsushi
    Masumura, Ryo
    INTERSPEECH 2023, 2023, : 2913 - 2917