Streaming Multi-talker Speech Recognition with Joint Speaker Identification

被引:10
|
作者
Lu, Liang [1 ]
Kanda, Naoyuki [1 ]
Li, Jinyu [1 ]
Gong, Yifan [1 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
来源
关键词
Overlapped speech recognition; Streaming; Unmixing transducer; Joint recognition and identification; OVERLAPPED SPEECH;
D O I
10.21437/Interspeech.2021-207
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In multi-talker scenarios such as meetings and conversations, speech processing systems are usually required to transcribe the audio as well as identify the speakers for downstream applications. Since overlapped speech is common in this case, conventional approaches usually address this problem in a cascaded fashion that involves speech separation, speech recognition and speaker identification that are trained independently. In this paper, we propose Streaming Unmixing, Recognition and Identification Transducer (SURIT) - a new framework that deals with this problem in an end-to-end streaming fashion. SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification. We validate our idea on the LibrispeechMix dataset - a multi-talker dataset derived from Librispeech, and present encouraging results.
引用
收藏
页码:1782 / 1786
页数:5
相关论文
共 50 条
  • [1] Streaming End-to-End Multi-Talker Speech Recognition
    Lu, Liang
    Kanda, Naoyuki
    Li, Jinyu
    Gong, Yifan
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 803 - 807
  • [2] Speaker Identification in Multi-Talker Overlapping Speech Using Neural Networks
    Tran, Van-Thuan
    Tsai, Wei-Ho
    IEEE ACCESS, 2020, 8 : 134868 - 134879
  • [3] Modeling speech localization, talker identification, and word recognition in a multi-talker setting
    Josupeit, Angela
    Hohmann, Volker
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 142 (01): : 35 - 54
  • [4] STREAMING NOISE CONTEXT AWARE ENHANCEMENT FOR AUTOMATIC SPEECH RECOGNITION IN MULTI-TALKER ENVIRONMENTS
    Caroselli, Joe
    Narayanan, Arun
    Huang, Yiteng
    2022 INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC 2022), 2022,
  • [5] Multi-Channel Speaker Verification for Single and Multi-talker Speech
    Kataria, Saurabh
    Zhang, Shi-Xiong
    Yu, Dong
    INTERSPEECH 2021, 2021, : 4608 - 4612
  • [6] A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition
    Tu, Yan-Hui
    Du, Jun
    Dai, Li-Rung
    Lee, Chin-Hui
    2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
  • [7] Selective cortical representation of attended speaker in multi-talker speech perception
    Mesgarani, Nima
    Chang, Edward F.
    NATURE, 2012, 485 (7397) : 233 - U118
  • [8] Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model
    Kocour, Martin
    Zmolikova, Katerina
    Ondel, Lucas
    Svec, Jan
    Delcroix, Marc
    Ochiai, Tsubasa
    Burget, Lukas
    Cernocky, Jan Honza
    INTERSPEECH 2022, 2022, : 4955 - 4959
  • [9] Selective cortical representation of attended speaker in multi-talker speech perception
    Nima Mesgarani
    Edward F. Chang
    Nature, 2012, 485 : 233 - 236
  • [10] Target Speaker Extraction for Multi-Talker Speaker Verification
    Rao, Wei
    Xu, Chenglin
    Chng, Eng Siong
    Li, Haizhou
    INTERSPEECH 2019, 2019, : 1273 - 1277