Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction

被引：0

作者：

Makishima, Naoki ^{[1
]}

Suzuki, Keita ^{[1
]}

Suzuki, Satoshi ^{[1
]}

Ando, Atsushi ^{[1
]}

Masumura, Ryo ^{[1
]}

机构：

[1] NTT Corp, NTT Comp & Data Sci Labs, Tokyo, Japan

来源：

INTERSPEECH 2023 | 2023年

关键词：

multi-talker automatic speech recognition; timestamp prediction; autoregressive modeling; SEPARATION;

D O I：

10.21437/Interspeech.2023-564

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper proposes autoregressive modeling of the joint multitalker automatic speech recognition (ASR) and timestamp prediction. Autoregressive modeling of multi-talker ASR is a simple and promising approach. However, it does not predict utterance timestamp information despite its being important in practice. To address this problem, our key idea is to extend autoregressive-modeling-based multi-talker ASR to predict quantized timestamp tokens representing the start and end time of an utterance. Our method estimates transcription and utterance-level timestamp tokens of multiple speakers one after another. This enables joint modeling of multi-talker ASR and timestamps prediction without changing the simple autoregressive modeling of the conventional multi-talker ASR. Experimental results show that our method outperforms the ASR performance of conventional autoregressive multi-talker ASR without timestamp prediction and achieves promising timestamp prediction accuracy.

引用

页码：2913 / 2917

页数：5

共 50 条

[31] END-TO-END MULTI-CHANNEL TRANSFORMER FOR SPEECH RECOGNITION
Chang, Feng-Ju
Radfar, Martin
Mouchtaris, Athanasios
King, Brian
Kunzmann, Siegfried
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5884 - 5888
[32] End-to-End Multilingual Multi-Speaker Speech Recognition
Seki, Hiroshi
Hori, Takaaki
Watanabe, Shinji
Le Roux, Jonathan
Hershey, John R.
INTERSPEECH 2019, 2019, : 3755 - 3759
[33] Multi-channel Attention for End-to-End Speech Recognition
Braun, Stefan
Neil, Daniel
Anumula, Jithendar
Ceolini, Enea
Liu, Shih-Chii
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 17 - 21
[34] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION WITH TRANSFORMER
Chang, Xuankai
Zhang, Wangyou
Qian, Yanmin
Le Roux, Jonathan
Watanabe, Shinji
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6134 - 6138
[35] Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model
Kocour, Martin
Zmolikova, Katerina
Ondel, Lucas
Svec, Jan
Delcroix, Marc
Ochiai, Tsubasa
Burget, Lukas
Cernocky, Jan Honza
INTERSPEECH 2022, 2022, : 4955 - 4959
[36] A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition
Tu, Yan-Hui
Du, Jun
Dai, Li-Rung
Lee, Chin-Hui
2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
[37] Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification
Zhang, C.
Li, B.
Sainath, T. N.
Strohman, T.
Mavandadi, S.
Chang, S.
Haghani, P.
INTERSPEECH 2022, 2022, : 3223 - 3227
[38] JOINT PHONEME-GRAPHEME MODEL FOR END-TO-END SPEECH RECOGNITION
Kubo, Yotaro
Bacchiani, Michiel
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6119 - 6123
[39] Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition
Tian, Zhengkun
Yi, Jiangyan
Tao, Jianhua
Bai, Ye
Zhang, Shuai
Wen, Zhengqi
INTERSPEECH 2020, 2020, : 5026 - 5030
[40] End-to-End Speech Recognition with Auditory Attention for Multi-Microphone Distance Speech Recognition
Kim, Suyoun
Lane, Ian
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3867 - 3871

← 1 2 3 4 5 →