End-to-End Speaker-Attributed ASR with Transformer

被引：11

作者：

Kanda, Naoyuki ^{[1
]}

Ye, Guoli ^{[1
]}

Gaur, Yashesh ^{[1
]}

Wang, Xiaofei ^{[1
]}

Meng, Zhong ^{[1
]}

Chen, Zhuo ^{[1
]}

Yoshioka, Takuya ^{[1
]}

机构：

[1] Microsoft Corp, Redmond, WA 98052 USA

来源：

INTERSPEECH 2021 | 2021年

关键词：

multi-speaker speech recognition; speaker counting; speaker identification; serialized output training; SPEECH RECOGNITION; DIARIZATION;

D O I：

10.21437/Interspeech.2021-101

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

This paper presents our recent effort on end-to-end speaker-attributed automatic speech recognition, which jointly performs speaker counting, speech recognition and speaker identification for monaural multi-talker audio. Firstly, we thoroughly update the model architecture that was previously designed based on a long short-term memory (LSTM)-based attention encoder decoder by applying transformer architectures. Secondly, we propose a speaker deduplication mechanism to reduce speaker identification errors in highly overlapped regions. Experimental results on the LibriSpeechMix dataset shows that the transformer-based architecture is especially good at counting the speakers and that the proposed model reduces the speakerattributed word error rate by 47% over the LSTM-based baseline. Furthermore, for the LibriCSS dataset, which consists of real recordings of overlapped speech, the proposed model achieves concatenated minimum-permutation word error rates of 11.9% and 16.3% with and without target speaker profiles, respectively, both of which are the state-of-the-art results for LibriCSS with the monaural setting.

引用

页码：4413 / 4417

页数：5

共 50 条

[41] Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning
Zeng, Zhiping
Pham, Van Tung
Xu, Haihua
Khassanov, Yerbolat
Chng, Eng Siong
Ni, Chongjia
Ma, Bin
2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
[42] Developing State-of-the-Art End-to-End ASR for Norwegian
Nouza, Jan
Mateju, Lukas
Cerva, Petr
Zdansky, Jindrich
TEXT, SPEECH, AND DIALOGUE, TSD 2023, 2023, 14102 : 200 - 213
[43] Comparison and Analysis of New Curriculum Criteria for End-to-End ASR
Karakasidis, Georgios
Grosz, Tamas
Kurimo, Mikko
INTERSPEECH 2022, 2022, : 66 - 70
[44] Joint Grapheme and Phoneme Embeddings for Contextual End-to-End ASR
Chen, Zhehuai
Jain, Mahaveer
Wang, Yongqiang
Seltzer, Michael L.
Fuegen, Christian
INTERSPEECH 2019, 2019, : 3490 - 3494
[45] BILINGUAL END-TO-END ASR WITH BYTE-LEVEL SUBWORDS
Deng, Liuhui
Hsiao, Roger
Ghoshal, Arnab
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6417 - 6421
[46] Comparison and analysis of new curriculum criteria for end-to-end ASR
Karakasidis, Georgios
Kurimo, Mikko
Bell, Peter
Grosz, Tamas
SPEECH COMMUNICATION, 2024, 163
[47] End-to-end ASR to jointly predict transcriptions and linguistic annotations
Omachi, Motoi
Fujita, Yuya
Watanabe, Shinji
Wiesner, Matthew
2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 1861 - 1871
[48] Multi-Modal Data Augmentation for End-to-End ASR
Renduchintala, Adithya
Ding, Shuoyang
Wiesner, Matthew
Watanabe, Shinji
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2394 - 2398
[49] Data Augmentation Using CycleGAN for End-to-End Children ASR
Singh, Dipesh K.
Amin, Preet P.
Sailor, Hardik B.
Patil, Hemant A.
29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 511 - 515
[50] Iterative Compression of End-to-End ASR Model using AutoML
Mehrotra, Abhinav
Dudziak, Lukasz
Yeo, Jinsu
Lee, Young-yoon
Vipperla, Ravichander
Abdelfattah, Mohamed S.
Bhattacharya, Sourav
Ishtiaq, Samin
Ramos, Alberto Gil C. P.
Lee, SangJeong
Kim, Daehyun
Lane, Nicholas D.
INTERSPEECH 2020, 2020, : 3361 - 3365

← 1 2 3 4 5 →