CONFIDENCE ESTIMATION FOR ATTENTION-BASED SEQUENCE-TO-SEQUENCE MODELS FOR SPEECH RECOGNITION

被引：24

作者：

Li, Qiujia ^{[1
,3
]}

Qiu, David ^{[2
]}

Zhang, Yu ^{[2
]}

Li, Bo ^{[2
]}

He, Yanzhang ^{[2
]}

Woodland, Philip C. ^{[1
]}

Cao, Liangliang ^{[2
]}

Strohman, Trevor ^{[2
]}

机构：

[1] Univ Cambridge, Cambridge, England

[2] Google LLC, Mountain View, CA 94043 USA

[3] Google, Mountain View, CA 94043 USA

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

confidence scores; end-to-end ASR;

D O I：

10.1109/ICASSP39728.2021.9414920

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

For various speech-related tasks, confidence scores from a speech recogniser are a useful measure to assess the quality of transcriptions. In traditional hidden Markov model-based automatic speech recognition (ASR) systems, confidence scores can be reliably obtained from word posteriors in decoding lattices. However, for an ASR system with an auto-regressive decoder, such as an attention-based sequence-to-sequence model, computing word posteriors is difficult. An obvious alternative is to use the decoder softmax probability as the model confidence. In this paper, we first examine how some commonly used regularisation methods influence the softmax-based confidence scores and study the overconfident behaviour of end-to-end models. Then we propose a lightweight and effective approach named confidence estimation module (CEM) on top of an existing end-to-end ASR model. Experiments on LibriSpeech show that CEM can mitigate the overconfidence problem and can produce more reliable confidence scores with and without shallow fusion of a language model. Further analysis shows that CEM generalises well to speech from a moderately mismatched domain and can potentially improve downstream tasks such as semi-supervised learning.

引用

页码：6388 / 6392

页数：5

共 50 条

[41] MULTI-DIALECT SPEECH RECOGNITION WITH A SINGLE SEQUENCE-TO-SEQUENCE MODEL
Li, Bo
Sainath, Tara N.
Sim, Khe Chai
Bacchiani, Michiel
Weinstein, Eugene
Nguyen, Patrick
Chen, Zhifeng
Wu, Yonghui
Rao, Kanishka
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4749 - 4753
[42] Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions
Hannun, Awni
Lee, Ann
Xu, Qiantong
Collobert, Ronan
INTERSPEECH 2019, 2019, : 3785 - 3789
[43] Lattice generation in attention-based speech recognition models
Zapotoczny, Michal
Pietrzak, Piotr
Lancucki, Adrian
Chorowski, Jan
INTERSPEECH 2019, 2019, : 2225 - 2229
[44] LEVERAGING SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS FOR ENHANCING ACOUSTIC-TO-WORD SPEECH RECOGNITION
Mimura, Masato
Ueno, Sei
Inaguma, Hirofumi
Sakai, Shinsuke
Kawahara, Tatsuya
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 477 - 484
[45] Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition
Bai, Ye
Yi, Jiangyan
Tao, Jianhua
Tian, Zhengkun
Wen, Zhengqi
INTERSPEECH 2019, 2019, : 3795 - 3799
[46] A Two-level Attention-based Sequence-to-Sequence Model for Accurate Inter-patient Arrhythmia Detection
Jiang, Kun
Liang, Shen
Meng, Lingxiao
Zhang, Yanchun
Wang, Peng
Wang, Wei
2020 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2020, : 1029 - 1033
[47] Sparse Sequence-to-Sequence Models
Peters, Ben
Niculae, Vlad
Martins, Andre F. T.
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1504 - 1519
[48] MULTILINGUAL SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION: ARCHITECTURE, TRANSFER LEARNING, AND LANGUAGE MODELING
Cho, Jaejin
Baskar, Murali Karthick
Li, Ruizhi
Wiesner, Matthew
Mallidi, Sri Harish
Yalta, Nelson
Karafiat, Martin
Watanabe, Shinji
Hori, Takaaki
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 521 - 527
[49] IMPROVING SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION TRAINING WITH ON-THE-FLY DATA AUGMENTATION
Nguyen, Thai-Son
Stuker, Sebastian
Niehues, Jan
Waibel, Alex
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7689 - 7693
[50] MITIGATING THE IMPACT OF SPEECH RECOGNITION ERRORS ON CHATBOT USING SEQUENCE-TO-SEQUENCE MODEL
Chen, Pin-Jung
Hsu, I-Hung
Huang, Yi-Yao
Lee, Hung-Yi
2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 497 - 503

← 1 2 3 4 5 →