Streaming Multi-talker Speech Recognition with Joint Speaker Identification

被引：10

作者：

Lu, Liang ^{[1
]}

Kanda, Naoyuki ^{[1
]}

Li, Jinyu ^{[1
]}

Gong, Yifan ^{[1
]}

机构：

[1] Microsoft Corp, Redmond, WA 98052 USA

来源：

INTERSPEECH 2021 | 2021年

关键词：

Overlapped speech recognition; Streaming; Unmixing transducer; Joint recognition and identification; OVERLAPPED SPEECH;

D O I：

10.21437/Interspeech.2021-207

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

In multi-talker scenarios such as meetings and conversations, speech processing systems are usually required to transcribe the audio as well as identify the speakers for downstream applications. Since overlapped speech is common in this case, conventional approaches usually address this problem in a cascaded fashion that involves speech separation, speech recognition and speaker identification that are trained independently. In this paper, we propose Streaming Unmixing, Recognition and Identification Transducer (SURIT) - a new framework that deals with this problem in an end-to-end streaming fashion. SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification. We validate our idea on the LibrispeechMix dataset - a multi-talker dataset derived from Librispeech, and present encouraging results.

引用

页码：1782 / 1786

页数：5

共 50 条

[1] Streaming End-to-End Multi-Talker Speech Recognition
Lu, Liang
Kanda, Naoyuki
Li, Jinyu
Gong, Yifan
IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 803 - 807
[2] Speaker Identification in Multi-Talker Overlapping Speech Using Neural Networks
Tran, Van-Thuan
Tsai, Wei-Ho
IEEE ACCESS, 2020, 8 : 134868 - 134879
[3] Modeling speech localization, talker identification, and word recognition in a multi-talker setting
Josupeit, Angela
Hohmann, Volker
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 142 (01): : 35 - 54
[4] STREAMING NOISE CONTEXT AWARE ENHANCEMENT FOR AUTOMATIC SPEECH RECOGNITION IN MULTI-TALKER ENVIRONMENTS
Caroselli, Joe
Narayanan, Arun
Huang, Yiteng
2022 INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC 2022), 2022,
[5] Multi-Channel Speaker Verification for Single and Multi-talker Speech
Kataria, Saurabh
Zhang, Shi-Xiong
Yu, Dong
INTERSPEECH 2021, 2021, : 4608 - 4612
[6] A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition
Tu, Yan-Hui
Du, Jun
Dai, Li-Rung
Lee, Chin-Hui
2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
[7] Selective cortical representation of attended speaker in multi-talker speech perception
Mesgarani, Nima
Chang, Edward F.
NATURE, 2012, 485 (7397) : 233 - U118
[8] Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model
Kocour, Martin
Zmolikova, Katerina
Ondel, Lucas
Svec, Jan
Delcroix, Marc
Ochiai, Tsubasa
Burget, Lukas
Cernocky, Jan Honza
INTERSPEECH 2022, 2022, : 4955 - 4959
[9] Selective cortical representation of attended speaker in multi-talker speech perception
Nima Mesgarani
Edward F. Chang
Nature, 2012, 485 : 233 - 236
[10] Target Speaker Extraction for Multi-Talker Speaker Verification
Rao, Wei
Xu, Chenglin
Chng, Eng Siong
Li, Haizhou
INTERSPEECH 2019, 2019, : 1273 - 1277

← 1 2 3 4 5 →