Attention based end to end Speech Recognition for Voice Search in Hindi and English

被引:4
|
作者
Joshi, Raviraj [1 ]
Kannan, Venkateshan [1 ]
机构
[1] Flipkart, Bengaluru, India
关键词
automatic speech recognition; encoder-decoder models; attention; listen attend spell; SYSTEM; CHALLENGE;
D O I
10.1145/3503162.3503173
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We describe here our work with automatic speech recognition (ASR) in the context of voice search functionality on the Flipkart e-Commerce platform. Starting with the deep learning architecture of Listen-Attend-Spell (LAS), we build upon and expand the model design and attention mechanisms to incorporate innovative approaches including multi-objective training, multi-pass training, and external rescoring using language models and phoneme based losses. We report a relative WER improvement of 15.7% on top of state-of-the-art LAS models using these modifications. Overall, we report an improvement of 36.9% over the phoneme-CTC system on the Flipkart Voice Search dataset. The paper also provides an overview of different components that can be tuned in a LAS based system.
引用
收藏
页码:107 / 113
页数:7
相关论文
共 50 条
  • [31] STREAM ATTENTION-BASED MULTI-ARRAY END-TO-END SPEECH RECOGNITION
    Wang, Xiaofei
    Li, Ruizhi
    Mallidi, Sri Harish
    Hori, Takaaki
    Watanabe, Shinji
    Hermansky, Hynek
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7105 - 7109
  • [32] DEEP ENCODED LINGUISTIC AND ACOUSTIC CUES FOR ATTENTION BASED END TO END SPEECH EMOTION RECOGNITION
    Bhosale, Swapnil
    Chakraborty, Rupayan
    Kopparapu, Sunil Kumar
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7189 - 7193
  • [33] End-to-end Named Entity Recognition from English Speech
    Yadav, Hemant
    Ghosh, Sreyan
    Yu, Yi
    Shah, Rajiv Ratn
    INTERSPEECH 2020, 2020, : 4268 - 4272
  • [34] STREAMING END-TO-END SPEECH RECOGNITION WITH JOINT CTC-ATTENTION BASED MODELS
    Moritz, Niko
    Hori, Takaaki
    Le Roux, Jonathan
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 936 - 943
  • [35] END-TO-END AUTOMATIC SPEECH RECOGNITION INTEGRATED WITH CTC-BASED VOICE ACTIVITY DETECTION
    Yoshimura, Takenori
    Hayashi, Tomoki
    Takeda, Kazuya
    Watanabe, Shinji
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6999 - 7003
  • [36] Online Hybrid CTC/Attention Architecture for End-to-end Speech Recognition
    Miao, Haoran
    Cheng, Gaofeng
    Zhang, Pengyuan
    Li, Ta
    Yan, Yonghong
    INTERSPEECH 2019, 2019, : 2623 - 2627
  • [37] Noise-robust Attention Learning for End-to-End Speech Recognition
    Higuchi, Yosuke
    Tawara, Naohiro
    Ogawa, Atsunori
    Iwata, Tomoharu
    Kobayashi, Tetsunori
    Ogawa, Tetsuji
    28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 311 - 315
  • [38] MODALITY ATTENTION FOR END-TO-END AUDIO-VISUAL SPEECH RECOGNITION
    Zhou, Pan
    Yang, Wenwen
    Chen, Wei
    Wang, Yanfeng
    Jia, Jia
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6565 - 6569
  • [39] AN END-TO-END SPEECH ACCENT RECOGNITION METHOD BASED ON HYBRID CTC/ATTENTION TRANSFORMER ASR
    Gao, Qiang
    Wu, Haiwei
    Sun, Yanqing
    Duan, Yitao
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7253 - 7257
  • [40] A hybrid CTC+Attention model based on end-to-end framework for multilingual speech recognition
    Sendong Liang
    Wei Qi Yan
    Multimedia Tools and Applications, 2022, 81 : 41295 - 41308