CONFIDENCE ESTIMATION FOR ATTENTION-BASED SEQUENCE-TO-SEQUENCE MODELS FOR SPEECH RECOGNITION

被引：24

作者：

Li, Qiujia ^{[1
,3
]}

Qiu, David ^{[2
]}

Zhang, Yu ^{[2
]}

Li, Bo ^{[2
]}

He, Yanzhang ^{[2
]}

Woodland, Philip C. ^{[1
]}

Cao, Liangliang ^{[2
]}

Strohman, Trevor ^{[2
]}

机构：

[1] Univ Cambridge, Cambridge, England

[2] Google LLC, Mountain View, CA 94043 USA

[3] Google, Mountain View, CA 94043 USA

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

关键词：

confidence scores; end-to-end ASR;

D O I：

10.1109/ICASSP39728.2021.9414920

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

For various speech-related tasks, confidence scores from a speech recogniser are a useful measure to assess the quality of transcriptions. In traditional hidden Markov model-based automatic speech recognition (ASR) systems, confidence scores can be reliably obtained from word posteriors in decoding lattices. However, for an ASR system with an auto-regressive decoder, such as an attention-based sequence-to-sequence model, computing word posteriors is difficult. An obvious alternative is to use the decoder softmax probability as the model confidence. In this paper, we first examine how some commonly used regularisation methods influence the softmax-based confidence scores and study the overconfident behaviour of end-to-end models. Then we propose a lightweight and effective approach named confidence estimation module (CEM) on top of an existing end-to-end ASR model. Experiments on LibriSpeech show that CEM can mitigate the overconfidence problem and can produce more reliable confidence scores with and without shallow fusion of a language model. Further analysis shows that CEM generalises well to speech from a moderately mismatched domain and can potentially improve downstream tasks such as semi-supervised learning.

引用

页码：6388 / 6392

页数：5

共 50 条

[31] CORRECTION OF AUTOMATIC SPEECH RECOGNITION WITH TRANSFORMER SEQUENCE-TO-SEQUENCE MODEL
Hrinchuk, Oleksii
Popova, Mariya
Ginsburg, Boris
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7074 - 7078
[32] ACOUSTIC-TO-WORD RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS
Palaskar, Shruti
Metze, Florian
2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 397 - 404
[33] Sequence-to-Sequence Models Can Directly Translate Foreign Speech
Weiss, Ron J.
Chorowski, Jan
Jaitly, Navdeep
Wu, Yonghui
Chen, Zhifeng
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2625 - 2629
[34] SPEECH-TRANSFORMER: A NO-RECURRENCE SEQUENCE-TO-SEQUENCE MODEL FOR SPEECH RECOGNITION
Dong, Linhao
Xu, Shuang
Xu, Bo
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5884 - 5888
[35] Detection and analysis of attention errors in sequence-to-sequence text-to-speech
Valentini-Botinhao, Cassia
King, Simon
INTERSPEECH 2021, 2021, : 2746 - 2750
[36] Guiding Attention in Sequence-to-Sequence Models for Dialogue Act prediction
Colombo, Pierre
Chapuis, Emile
Manica, Matteo
Vignon, Emmanuel
Varni, Giovanna
Clavel, Chloe
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 7594 - 7601
[37] Automatic Pronunciation Generator for Indonesian Speech Recognition System Based on Sequence-to-Sequence Model
Hoesen, Devin
Putri, Fanda Yuliana
Lestari, Dessi Puji
2019 22ND CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA), 2019, : 7 - 12
[38] Dual Attention-Based Encoder-Decoder: A Customized Sequence-to-Sequence Learning for Soft Sensor Development
Feng, Liangjun
Zhao, Chunhui
Sun, Youxian
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (08) : 3306 - 3317
[39] Enhanced Sequence-to-Sequence Attention-Based PM2.5 Concentration Forecasting Using Spatiotemporal Data
Kim, Baekcheon
Kim, Eunkyeong
Jung, Seunghwan
Kim, Minseok
Kim, Jinyong
Kim, Sungshin
ATMOSPHERE, 2024, 15 (12)
[40] Towards Understanding Attention-Based Speech Recognition Models
Qin, Chu-Xiong
Qu, Dan
IEEE ACCESS, 2020, 8 : 24358 - 24369

← 1 2 3 4 5 →