E2E-DASR: End-to-end deep learning-based dysarthric automatic speech recognition

被引:20
|
作者
Almadhor, Ahmad [1 ]
Irfan, Rizwana [2 ]
Gao, Jiechao [3 ]
Saleem, Nasir [4 ]
Rauf, Hafiz Tayyab [5 ]
Kadry, Seifedine [6 ,7 ,8 ]
机构
[1] Jouf Univ, Coll Comp & Informat Sci, Dept Comp Engn & Networks, Sakakah, Saudi Arabia
[2] Univ Jeddah, Coll Comp & Informat Technol Khulais, Dept Informat Technol, Jeddah 21959, Saudi Arabia
[3] Univ Virginia, Dept Comp Sci, Charlottesville, VA 22904 USA
[4] Gomal Univ, Dept Elect Engn, FET, Dera Ismail Khan, Pakistan
[5] Staffordshire Univ, Ctr Smart Syst AI & Cybersecur, Stoke On Trent ST4 2DE, England
[6] Noroff Univ Coll, Dept Appl Data Sci, N-4612 Kristiansand, Norway
[7] Ajman Univ, Artificial Intelligence Res Ctr AIRC, POB 346, Ajman, U Arab Emirates
[8] Lebanese Amer Univ, Dept Elect & Comp Engn, POB 13, Byblos 5053, Lebanon
关键词
Dysarthria; Dysarthric ASR; Speech intelligibility; Words error; Multi-head transformer; CNN; FEATURES; SYSTEM;
D O I
10.1016/j.eswa.2023.119797
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dysarthria is a motor speech disability caused by weak muscles and organs involved in the articulation process, thereby affecting the speech intelligibility of individuals. Because this condition is linked to physical exhaustion disabilities, individuals not only have communication difficulties, but also have difficulty interacting with digital devices. Automatic speech recognition (ASR) makes an important difference for individuals with dysarthria since modern digital devices offer a better interaction medium that enables them to interact with their community and computers. Still, the performance of ASR technologies is poor in recognizing dysarthric speech, particularly for acute dysarthria. Multiple challenges, including dysarthric phoneme inaccuracy and labeling imperfection, are facing dysarthric ASR technologies. This paper proposes a spatio-temporal dysarthric ASR (DASR) system using Spatial Convolutional Neural Network (SCNN) and Multi-Head Attention Transformer (MHAT) to visually extract the speech features, and DASR learns the shapes of phonemes pronounced by dysarthric individuals. This visual DASR feature modeling eliminates phoneme-related challenges. The UA-Speech database is utilized in this paper, including different speakers with different speech intelligibility levels. However, because the proportion of us-able speech data to the number of distinctive classes in the UA-speech database was small, the proposed DASR system leverages transfer learning to generate synthetic leverage and visuals. In benchmarking with other DASRs examined in this study, the proposed DASR system outperformed and improved the recognition accuracy for 20.72% of the UA-Speech database. The largest improvements were achieved for very-low (25.75%) and low intelligibility (33.67%).
引用
收藏
页数:12
相关论文
共 50 条
  • [41] Towards end-to-end speech recognition with transfer learning
    Chu-Xiong Qin
    Dan Qu
    Lian-Hai Zhang
    EURASIP Journal on Audio, Speech, and Music Processing, 2018
  • [42] Towards end-to-end speech recognition with transfer learning
    Qin, Chu-Xiong
    Qu, Dan
    Zhang, Lian-Hai
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2018,
  • [43] Deep Learning-based Frame and Timing Synchronization for End-to-End Communications
    Wu, Hengmiao
    Sun, Zhuo
    Zhou, Xue
    2018 3RD INTERNATIONAL CONFERENCE ON COMMUNICATION, IMAGE AND SIGNAL PROCESSING, 2019, 1169
  • [44] MINTZAI: End-to-end Deep Learning for Speech Translation
    Etchegoyhen, Thierry
    Arzelus, Haritz
    Gete, Harritxu
    Alvarez, Aitor
    Hernaez, Inma
    Navas, Eva
    Gonzalez-Docasal, Ander
    Osacar, Jaime
    Benites, Edson
    Ellakuria, Igor
    Calonge, Eusebi
    Martin, Maite
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2020, (65): : 97 - 100
  • [45] End-to-end residual learning-based deep neural network model deployment for human activity recognition
    Alok Negi
    Krishan Kumar
    International Journal of Multimedia Information Retrieval, 2023, 12
  • [46] End-to-end residual learning-based deep neural network model deployment for human activity recognition
    Negi, Alok
    Kumar, Krishan
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2023, 12 (01)
  • [47] VERY DEEP CONVOLUTIONAL NETWORKS FOR END-TO-END SPEECH RECOGNITION
    Zhang, Yu
    Chan, William
    Jaitly, Navdeep
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4845 - 4849
  • [48] A New End-to-end Modulation Recognition Algorithm Based on Deep Learning
    Gao, Jingpeng
    Wang, Fu
    Gao, Lu
    Wang, Xu
    PROCEEDINGS OF 2020 IEEE 15TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP 2020), 2020, : 346 - 350
  • [49] SP-LCDNN: A End-to-End Deep Learning Model for Automatic Modulation Recognition
    Shao, Mingyuan
    Yuan, Yuting
    Li, Dingzhao
    Hong, Shaohua
    Qi, Jie
    Sun, Haixin
    2024 IEEE/CIC INTERNATIONAL CONFERENCE ON COMMUNICATIONS IN CHINA, ICCC, 2024,
  • [50] An End-to-End Deep Learning Approach to Simultaneous Speech Dereverberation and Acoustic Modeling for Robust Speech Recognition
    Wu, Bo
    Li, Kehuang
    Ge, Fengpei
    Huang, Zhen
    Yang, Minglei
    Siniscalchi, Sabato Marco
    Lee, Chin-Hui
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1289 - 1300