Attention based end to end Speech Recognition for Voice Search in Hindi and English

被引：4

作者：

Joshi, Raviraj ^{[1
]}

Kannan, Venkateshan ^{[1
]}

机构：

[1] Flipkart, Bengaluru, India

来源：

FIRE 2021: PROCEEDINGS OF THE 13TH ANNUAL MEETING OF THE FORUM FOR INFORMATION RETRIEVAL EVALUATION | 2021年

关键词：

automatic speech recognition; encoder-decoder models; attention; listen attend spell; SYSTEM; CHALLENGE;

D O I：

10.1145/3503162.3503173

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We describe here our work with automatic speech recognition (ASR) in the context of voice search functionality on the Flipkart e-Commerce platform. Starting with the deep learning architecture of Listen-Attend-Spell (LAS), we build upon and expand the model design and attention mechanisms to incorporate innovative approaches including multi-objective training, multi-pass training, and external rescoring using language models and phoneme based losses. We report a relative WER improvement of 15.7% on top of state-of-the-art LAS models using these modifications. Overall, we report an improvement of 36.9% over the phoneme-CTC system on the Flipkart Voice Search dataset. The paper also provides an overview of different components that can be tuned in a LAS based system.

引用

页码：107 / 113

页数：7

共 50 条

[21] STRUCTURED SPARSE ATTENTION FOR END-TO-END AUTOMATIC SPEECH RECOGNITION
Xue, Jiabin
Zheng, Tieran
Han, Jiqing
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7044 - 7048
[22] Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
Watanabe, Shinji
Hori, Takaaki
Kim, Suyoun
Hershey, John R.
Hayashi, Tomoki
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1240 - 1253
[23] Multi-channel Attention for End-to-End Speech Recognition
Braun, Stefan
Neil, Daniel
Anumula, Jithendar
Ceolini, Enea
Liu, Shih-Chii
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 17 - 21
[24] Joint CTC/attention decoding for end-to-end speech recognition
Hori, Takaaki
Watanabe, Shinji
Hershey, John R.
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 518 - 529
[25] Self-Attention Transducers for End-to-End Speech Recognition
Tian, Zhengkun
Yi, Jiangyan
Tao, Jianhua
Bai, Ye
Wen, Zhengqi
INTERSPEECH 2019, 2019, : 4395 - 4399
[26] SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION
Luo, Haoneng
Zhang, Shiliang
Lei, Ming
Xie, Lei
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 75 - 81
[27] STREAMING ATTENTION-BASED MODELS WITH AUGMENTED MEMORY FOR END-TO-END SPEECH RECOGNITION
Yeh, Ching-Feng
Wang, Yongqiang
Shi, Yangyang
Wu, Chunyang
Zhang, Frank
Chan, Julian
Seltzer, Michael L.
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 8 - 14
[28] Exploring attention mechanisms based on summary information for end-to-end automatic speech recognition
Xue, Jiabin
Zheng, Tieran
Han, Jiqing
NEUROCOMPUTING, 2021, 465 : 514 - 524
[29] TRANSFORMER-BASED ONLINE CTC/ATTENTION END-TO-END SPEECH RECOGNITION ARCHITECTURE
Miao, Haoran
Cheng, Gaofeng
Gao, Changfeng
Zhang, Pengyuan
Yan, Yonghong
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6084 - 6088
[30] TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION WITH LOCAL DENSE SYNTHESIZER ATTENTION
Xu, Menglong
Li, Shengqiang
Zhang, Xiao-Lei
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5899 - 5903

← 1 2 3 4 5 →