CFDRN: A Cognition-Inspired Feature Decomposition and Recombination Network for Dysarthric Speech Recognition

被引:1
|
作者
Lin, Yuqin [1 ]
Wang, Longbiao [1 ,2 ]
Yang, Yanbing [1 ]
Dang, Jianwu [1 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin 300350, Peoples R China
[2] Huiyan Technol Tianjin Co Ltd, Tianjin 300350, Peoples R China
基金
中国国家自然科学基金;
关键词
Adaptation; automatic speech recognition; dysarthria; AUDITORY-CORTEX; OSCILLATIONS; ADAPTATION; PROGRESS;
D O I
10.1109/TASLP.2023.3319276
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
As an essential technology in human-computer interactions, automatic speech recognition (ASR) ensures a convenient life for healthy people; however, people with speech disorders, who truly need support from such a technology, have experienced difficulties in the use of ASR. Disordered ASR is challenging because of the large variabilities in disordered speech. Humans tend to separately process different spectro-temporal features of speech in the left and right hemispheres of their brain, showing significantly better ability in speech perception than machines, especially in disordered speech perception. Inspired by human speech processing, this article proposes a cognition-inspired feature decomposition and recombination network (CFDRN) for dysarthric ASR. In the CFDRN, slow- and rapid-varying temporal processors are designed to decompose features into stable and changeable features, respectively. A gated fusion module was developed to selectively recombine the decomposed features. Moreover, this study utilised an adaptation approach based on unsupervised pre-training techniques to alleviate data scarcity issues in dysarthric ASR. The CFDRNs were added to the layers of the pre-trained model, and the entire model is adapted from normal speech to disordered speech. The effectiveness of the proposed method was validated on the widely used TORGO and UASpeech dysarthria datasets under three popular unsupervised pre-training techniques, wav2vec 2.0, HuBERT, and data2vec. When compared to the baseline methods, the proposed CFDRN with the three pre-training techniques achieved 13.73%similar to 16.23% and 4.50%similar to 13.20% word error rate reductions on the TORGO and UASpeech datasets, respectively. Furthermore, this study clarified several major factors affecting dysarthric ASR performance.
引用
收藏
页码:3824 / 3836
页数:13
相关论文
共 50 条
  • [31] Speech Recognition Using Sparse Discrete Wavelet Decomposition Feature Extraction
    Dai, Jingzhao
    Vijayarajan, Vinith
    Peng, Xuan
    Tan, Li
    Jiang, Jean
    2018 IEEE INTERNATIONAL CONFERENCE ON ELECTRO/INFORMATION TECHNOLOGY (EIT), 2018, : 812 - 816
  • [32] A Study on Speech Recognition by a Neural Network Based on English Speech Feature Parameters
    Mao, Congmin
    Liu, Sujing
    JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2024, 28 (03) : 679 - 684
  • [33] Enhancing dysarthric speech recognition through SepFormer and hierarchical attention network models with multistage transfer learning
    Vinotha, R.
    Hepsiba, D.
    Anand, L. D. Vijay
    Andrew, J.
    Eunice, R. Jennifer
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [34] Efficient distribution of feature parameters for speech recognition in network environments
    Yoon, JS
    Lee, GH
    Kim, HK
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2005, PT 1, 2005, 3767 : 477 - 488
  • [35] A Neural Network Based Nonlinear Feature Transformation for Speech Recognition
    Hu, Hongbing
    Zahorian, Stephen A.
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 1533 - +
  • [36] Articulatory feature extraction for speech recognition using neural network
    Huda, Mohammad Nurul
    Hasan, Mohammad Mahedi
    Hassan, Foyzul
    Kotwal, Mohammed Rokibul Alam
    Muhammad, Ghulam
    Rahman, Chowdhury Mofizur
    International Review on Computers and Software, 2011, 6 (01) : 25 - 31
  • [37] Auditory-modeling inspired methods of feature extraction for robust automatic speech recognition
    Jing, ZN
    Hasegawa-Johnson, M
    2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 4176 - 4176
  • [38] Speech Recognition Based on Deep Tensor Neural Network and Multifactor Feature
    Shan, Yahui
    Liu, Min
    Zhan, Qingran
    Du, Shixuan
    Wang, Jing
    Xie, Xiang
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 650 - 654
  • [39] Speech Emotion Recognition with Heterogeneous Feature Unification of Deep Neural Network
    Jiang, Wei
    Wang, Zheng
    Jin, Jesse S.
    Han, Xianfeng
    Li, Chunguang
    SENSORS, 2019, 19 (12)
  • [40] An Auditory Inspired Amplitude Modulation Filter Bank for Robust Feature Extraction in Automatic Speech Recognition
    Moritz, Niko
    Anemueller, Joern
    Kollmeier, Birger
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (11) : 1926 - 1937