RAttSR: A Novel Low-Cost Reconstructed Attention-Based End-to-End Speech Recognizer

被引:0
|
作者
Bachchu Paul
Santanu Phadikar
机构
[1] Vidyasagar University,Department of Computer Science
[2] Maulana Abul Kalam Azad University of Technology,Department of Computer Science and Engineering
[3] West Bengal,undefined
关键词
Automatic speech recognition; Mel-spectrogram; Convolution neural network; Long short term memory; Attention model;
D O I
暂无
中图分类号
学科分类号
摘要
People are curious about voice commands for the next generation of interaction. It will play a dominant role in communicating with smart devices in the future. However, language remains a significant barrier to the widespread use of these devices. Even the existing models for the traditional languages need to compute extensive parameters, resulting in higher computational costs. The most inconvenient in the latest advanced models is that they are unable to function on devices with constrained resources. This paper proposes a novel end-to-end speech recognition based on a low-cost Bidirectional Long Short Term Memory (BiLSTM) attention model. The mel-spectrogram of the speech signals has been generated to feed into the proposed neural attention model to classify isolated words. It consists of three convolution layers followed by two layers of BiLSTM that encode a vector of length 64 to get attention against the input sequence. The convolution layers characterize the relationship among the energy bins in the spectrogram. The BiLSTM network removes the prolonged reliance on the input sequence, and the attention block finds the most significant region in the input sequence, reducing the computational cost in the classification process. The encoded vector by the attention head is fed to three-layered fully connected networks for recognition. The model takes only 133K parameters, less than several current state-of-the-art models for isolated word recognition. Two datasets, the Speech Command Dataset (SCD), and a self-made dataset we developed for fifteen spoken colors in the Bengali dialect, are utilized in this study. Applying the proposed technique, the performance evaluation with validation and test accuracy in the Bengali color dataset reaches 98.82% and 98.95%, respectively, which outperforms the current state-of-the-art models regarding accuracy and model size. When the SCD has been trained using the same network model, the average test accuracy obtained is 96.95%. To underpin the proposed model, the outcome is compared with the recent state-of-the-art models, and the result shows the superiority of the proposed model.
引用
收藏
页码:2454 / 2476
页数:22
相关论文
共 50 条
  • [21] Toward Low-Cost End-to-End Spoken Language Understanding
    Dinarelli, Marco
    Naguib, Marco
    Portet, Francois
    INTERSPEECH 2022, 2022, : 2728 - 2732
  • [22] AN END-TO-END LANGUAGE-TRACKING SPEECH RECOGNIZER FOR MIXED-LANGUAGE SPEECH
    Seki, Hiroshi
    Watanabe, Shinji
    Hori, Takaaki
    Le Roux, Jonathan
    Hershey, John R.
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4919 - 4923
  • [23] SPEAKER-AWARE TRAINING OF ATTENTION-BASED END-TO-END SPEECH RECOGNITION USING NEURAL SPEAKER EMBEDDINGS
    Rouhe, Aku
    Kaseva, Tuomas
    Kurimo, Mikko
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7064 - 7068
  • [24] TRIGGERED ATTENTION FOR END-TO-END SPEECH RECOGNITION
    Moritz, Niko
    Hori, Takaaki
    Le Roux, Jonathan
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5666 - 5670
  • [25] Performance Based Cost Functions for End-to-End Speech Separation
    Venkataramani, Shrikant
    Higa, Ryley
    Smaragdis, Paris
    2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 350 - 355
  • [26] Attention-based End-to-End Models for Small-Footprint Keyword Spotting
    Shan, Changhao
    Zhang, Junbo
    Wang, Yujun
    Xie, Lei
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2037 - 2041
  • [27] Attention-Based End-to-End Differentiable Particle Filter for Audio Speaker Tracking
    Zhao, Jinzheng
    Xu, Yong
    Qian, Xinyuan
    Liu, Haohe
    Plumbley, Mark D.
    Wang, Wenwu
    IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2024, 5 : 449 - 458
  • [28] EXPLORING END-TO-END ATTENTION-BASED NEURAL NETWORKS FOR NATIVE LANGUAGE IDENTIFICATION
    Ubale, Rutuja
    Qian, Yao
    Evanini, Keelan
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 84 - 91
  • [29] Bidirectional decoder networks for attention-based end-to-end offline handwriting recognition
    Doetsch, Patrick
    Zeyer, Albert
    Ney, Hermann
    PROCEEDINGS OF 2016 15TH INTERNATIONAL CONFERENCE ON FRONTIERS IN HANDWRITING RECOGNITION (ICFHR), 2016, : 361 - 366
  • [30] End-to-end Language Identification using Attention-based Recurrent Neural Networks
    Geng, Wang
    Wang, Wenfu
    Zhao, Yuanyuan
    Cai, Xinyuan
    Xu, Bo
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2944 - 2948