E2E-DASR: End-to-end deep learning-based dysarthric automatic speech recognition

被引:20
|
作者
Almadhor, Ahmad [1 ]
Irfan, Rizwana [2 ]
Gao, Jiechao [3 ]
Saleem, Nasir [4 ]
Rauf, Hafiz Tayyab [5 ]
Kadry, Seifedine [6 ,7 ,8 ]
机构
[1] Jouf Univ, Coll Comp & Informat Sci, Dept Comp Engn & Networks, Sakakah, Saudi Arabia
[2] Univ Jeddah, Coll Comp & Informat Technol Khulais, Dept Informat Technol, Jeddah 21959, Saudi Arabia
[3] Univ Virginia, Dept Comp Sci, Charlottesville, VA 22904 USA
[4] Gomal Univ, Dept Elect Engn, FET, Dera Ismail Khan, Pakistan
[5] Staffordshire Univ, Ctr Smart Syst AI & Cybersecur, Stoke On Trent ST4 2DE, England
[6] Noroff Univ Coll, Dept Appl Data Sci, N-4612 Kristiansand, Norway
[7] Ajman Univ, Artificial Intelligence Res Ctr AIRC, POB 346, Ajman, U Arab Emirates
[8] Lebanese Amer Univ, Dept Elect & Comp Engn, POB 13, Byblos 5053, Lebanon
关键词
Dysarthria; Dysarthric ASR; Speech intelligibility; Words error; Multi-head transformer; CNN; FEATURES; SYSTEM;
D O I
10.1016/j.eswa.2023.119797
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dysarthria is a motor speech disability caused by weak muscles and organs involved in the articulation process, thereby affecting the speech intelligibility of individuals. Because this condition is linked to physical exhaustion disabilities, individuals not only have communication difficulties, but also have difficulty interacting with digital devices. Automatic speech recognition (ASR) makes an important difference for individuals with dysarthria since modern digital devices offer a better interaction medium that enables them to interact with their community and computers. Still, the performance of ASR technologies is poor in recognizing dysarthric speech, particularly for acute dysarthria. Multiple challenges, including dysarthric phoneme inaccuracy and labeling imperfection, are facing dysarthric ASR technologies. This paper proposes a spatio-temporal dysarthric ASR (DASR) system using Spatial Convolutional Neural Network (SCNN) and Multi-Head Attention Transformer (MHAT) to visually extract the speech features, and DASR learns the shapes of phonemes pronounced by dysarthric individuals. This visual DASR feature modeling eliminates phoneme-related challenges. The UA-Speech database is utilized in this paper, including different speakers with different speech intelligibility levels. However, because the proportion of us-able speech data to the number of distinctive classes in the UA-speech database was small, the proposed DASR system leverages transfer learning to generate synthetic leverage and visuals. In benchmarking with other DASRs examined in this study, the proposed DASR system outperformed and improved the recognition accuracy for 20.72% of the UA-Speech database. The largest improvements were achieved for very-low (25.75%) and low intelligibility (33.67%).
引用
收藏
页数:12
相关论文
共 50 条
  • [31] Deep Learning-Based Acoustic Feature Representations for Dysarthric Speech Recognition
    Latha M.
    Shivakumar M.
    Manjula G.
    Hemakumar M.
    Kumar M.K.
    SN Computer Science, 4 (3)
  • [32] Hardware Accelerator for Transformer based End-to-End Automatic Speech Recognition System
    Yamini, Shaarada D.
    Mirishkar, Ganesh S.
    Vuppala, Anil Kumar
    Purini, Suresh
    2023 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW, 2023, : 93 - 100
  • [33] AUDITORY-BASED DATA AUGMENTATION FOR END-TO-END AUTOMATIC SPEECH RECOGNITION
    Tu, Zehai
    Deadman, Jack
    Ma, Ning
    Barker, Jon
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7447 - 7451
  • [34] Spectrograms Fusion-based End-to-end Robust Automatic Speech Recognition
    Shi, Hao
    Wang, Longbiao
    Li, Sheng
    Fang, Cunhang
    Dang, Jianwu
    Kawahara, Tatsuya
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 438 - 442
  • [35] An End-to-End Transformer-Based Automatic Speech Recognition for Qur?an Reciters
    Hadwan, Mohammed
    Alsayadi, Hamzah A.
    AL-Hagree, Salah
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 74 (02): : 3471 - 3487
  • [36] E2EAI: End-to-End Deep Learning Framework for Active Investing
    Wei, Zikai
    Dai, Bo
    Lin, Dahua
    PROCEEDINGS OF THE 4TH ACM INTERNATIONAL CONFERENCE ON AI IN FINANCE, ICAIF 2023, 2023, : 55 - 63
  • [37] STRUCTURED SPARSE ATTENTION FOR END-TO-END AUTOMATIC SPEECH RECOGNITION
    Xue, Jiabin
    Zheng, Tieran
    Han, Jiqing
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7044 - 7048
  • [38] The Processing of Stress in End-to-End Automatic Speech Recognition Models
    Bentum, Martijn
    ten Bosch, Louis
    Lentz, Tom
    INTERSPEECH 2024, 2024, : 2350 - 2354
  • [39] Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal
    Banala Saritha
    Mohammad Azharuddin Laskar
    Anish Monsley Kirupakaran
    Rabul Hussain Laskar
    Madhuchhanda Choudhury
    Nirupam Shome
    Circuits, Systems, and Signal Processing, 2024, 43 : 1839 - 1861
  • [40] IMPROVING END-TO-END SPEECH RECOGNITION WITH POLICY LEARNING
    Zhou, Yingbo
    Xiong, Caiming
    Socher, Richard
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5819 - 5823