E2E-DASR: End-to-end deep learning-based dysarthric automatic speech recognition

被引:20
|
作者
Almadhor, Ahmad [1 ]
Irfan, Rizwana [2 ]
Gao, Jiechao [3 ]
Saleem, Nasir [4 ]
Rauf, Hafiz Tayyab [5 ]
Kadry, Seifedine [6 ,7 ,8 ]
机构
[1] Jouf Univ, Coll Comp & Informat Sci, Dept Comp Engn & Networks, Sakakah, Saudi Arabia
[2] Univ Jeddah, Coll Comp & Informat Technol Khulais, Dept Informat Technol, Jeddah 21959, Saudi Arabia
[3] Univ Virginia, Dept Comp Sci, Charlottesville, VA 22904 USA
[4] Gomal Univ, Dept Elect Engn, FET, Dera Ismail Khan, Pakistan
[5] Staffordshire Univ, Ctr Smart Syst AI & Cybersecur, Stoke On Trent ST4 2DE, England
[6] Noroff Univ Coll, Dept Appl Data Sci, N-4612 Kristiansand, Norway
[7] Ajman Univ, Artificial Intelligence Res Ctr AIRC, POB 346, Ajman, U Arab Emirates
[8] Lebanese Amer Univ, Dept Elect & Comp Engn, POB 13, Byblos 5053, Lebanon
关键词
Dysarthria; Dysarthric ASR; Speech intelligibility; Words error; Multi-head transformer; CNN; FEATURES; SYSTEM;
D O I
10.1016/j.eswa.2023.119797
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dysarthria is a motor speech disability caused by weak muscles and organs involved in the articulation process, thereby affecting the speech intelligibility of individuals. Because this condition is linked to physical exhaustion disabilities, individuals not only have communication difficulties, but also have difficulty interacting with digital devices. Automatic speech recognition (ASR) makes an important difference for individuals with dysarthria since modern digital devices offer a better interaction medium that enables them to interact with their community and computers. Still, the performance of ASR technologies is poor in recognizing dysarthric speech, particularly for acute dysarthria. Multiple challenges, including dysarthric phoneme inaccuracy and labeling imperfection, are facing dysarthric ASR technologies. This paper proposes a spatio-temporal dysarthric ASR (DASR) system using Spatial Convolutional Neural Network (SCNN) and Multi-Head Attention Transformer (MHAT) to visually extract the speech features, and DASR learns the shapes of phonemes pronounced by dysarthric individuals. This visual DASR feature modeling eliminates phoneme-related challenges. The UA-Speech database is utilized in this paper, including different speakers with different speech intelligibility levels. However, because the proportion of us-able speech data to the number of distinctive classes in the UA-speech database was small, the proposed DASR system leverages transfer learning to generate synthetic leverage and visuals. In benchmarking with other DASRs examined in this study, the proposed DASR system outperformed and improved the recognition accuracy for 20.72% of the UA-Speech database. The largest improvements were achieved for very-low (25.75%) and low intelligibility (33.67%).
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Speech Vision: An End-to-End Deep Learning-Based Dysarthric Automatic Speech Recognition System
    Shahamiri, Seyed Reza
    IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, 2021, 29 : 852 - 861
  • [2] End-to-End Automatic Speech Recognition with Deep Mutual Learning
    Masumura, Ryo
    Ihori, Mana
    Takashima, Akihiko
    Tanaka, Tomohiro
    Ashihara, Takanori
    2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 632 - 637
  • [3] INCREMENTAL LEARNING FOR END-TO-END AUTOMATIC SPEECH RECOGNITION
    Fu, Li
    Li, Xiaoxiao
    Zi, Libo
    Zhang, Zhengchen
    Wu, Youzheng
    He, Xiaodong
    Zhou, Bowen
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 320 - 327
  • [4] E2E-SINCNET: TOWARD FULLY END-TO-END SPEECH RECOGNITION
    Parcollet, Titouan
    Morchid, Mohamed
    Linares, Georges
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7714 - 7718
  • [5] END-TO-END DYSARTHRIC SPEECH RECOGNITION USING MULTIPLE DATABASES
    Takashima, Yuki
    Takiguchi, Tetsuya
    Ariki, Yasuo
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6395 - 6399
  • [6] Continual Learning for Monolingual End-to-End Automatic Speech Recognition
    Vander Eeckt, Steven
    Van Hamme, Hugo
    2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 459 - 463
  • [7] Arabic speech recognition using end-to-end deep learning
    Alsayadi, Hamzah A.
    Abdelhamid, Abdelaziz A.
    Hegazy, Islam
    Fayed, Zaki T.
    IET SIGNAL PROCESSING, 2021, 15 (08) : 521 - 534
  • [8] An Overview of End-to-End Automatic Speech Recognition
    Wang, Dong
    Wang, Xiaodong
    Lv, Shaohe
    SYMMETRY-BASEL, 2019, 11 (08):
  • [9] LEARNING A SUBWORD INVENTORY JOINTLY WITH END-TO-END AUTOMATIC SPEECH RECOGNITION
    Drexler, Jennifer
    Glass, James
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6439 - 6443
  • [10] Dealing with Unknowns in Continual Learning for End-to-end Automatic Speech Recognition
    Sustek, Martin
    Sadhu, Samik
    Hermansky, Hynek
    INTERSPEECH 2022, 2022, : 1046 - 1050