CFDRN: A Cognition-Inspired Feature Decomposition and Recombination Network for Dysarthric Speech Recognition

被引:1
|
作者
Lin, Yuqin [1 ]
Wang, Longbiao [1 ,2 ]
Yang, Yanbing [1 ]
Dang, Jianwu [1 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin 300350, Peoples R China
[2] Huiyan Technol Tianjin Co Ltd, Tianjin 300350, Peoples R China
基金
中国国家自然科学基金;
关键词
Adaptation; automatic speech recognition; dysarthria; AUDITORY-CORTEX; OSCILLATIONS; ADAPTATION; PROGRESS;
D O I
10.1109/TASLP.2023.3319276
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
As an essential technology in human-computer interactions, automatic speech recognition (ASR) ensures a convenient life for healthy people; however, people with speech disorders, who truly need support from such a technology, have experienced difficulties in the use of ASR. Disordered ASR is challenging because of the large variabilities in disordered speech. Humans tend to separately process different spectro-temporal features of speech in the left and right hemispheres of their brain, showing significantly better ability in speech perception than machines, especially in disordered speech perception. Inspired by human speech processing, this article proposes a cognition-inspired feature decomposition and recombination network (CFDRN) for dysarthric ASR. In the CFDRN, slow- and rapid-varying temporal processors are designed to decompose features into stable and changeable features, respectively. A gated fusion module was developed to selectively recombine the decomposed features. Moreover, this study utilised an adaptation approach based on unsupervised pre-training techniques to alleviate data scarcity issues in dysarthric ASR. The CFDRNs were added to the layers of the pre-trained model, and the entire model is adapted from normal speech to disordered speech. The effectiveness of the proposed method was validated on the widely used TORGO and UASpeech dysarthria datasets under three popular unsupervised pre-training techniques, wav2vec 2.0, HuBERT, and data2vec. When compared to the baseline methods, the proposed CFDRN with the three pre-training techniques achieved 13.73%similar to 16.23% and 4.50%similar to 13.20% word error rate reductions on the TORGO and UASpeech datasets, respectively. Furthermore, this study clarified several major factors affecting dysarthric ASR performance.
引用
收藏
页码:3824 / 3836
页数:13
相关论文
共 50 条
  • [1] Brain Network Manifold Learned by Cognition-Inspired Graph Embedding Model for Emotion Recognition
    Li, Cunbo
    Li, Peiyang
    Chen, Zhaojin
    Yang, Lei
    Li, Fali
    Wan, Feng
    Cao, Zehong
    Yao, Dezhong
    Lu, Bao-Liang
    Xu, Peng
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2024, 54 (12): : 7794 - 7808
  • [2] Improving dysarthric speech recognition using empirical mode decomposition and convolutional neural network
    Mohammed Sidi Yakoub
    Sid-ahmed Selouani
    Brahim-Fares Zaidi
    Asma Bouchair
    EURASIP Journal on Audio, Speech, and Music Processing, 2020
  • [3] Improving dysarthric speech recognition using empirical mode decomposition and convolutional neural network
    Yakoub, Mohammed
    Selouani, Sid-ahmed
    Zaidi, Brahim-Fares
    Bouchair, Asma
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2020, 2020 (01)
  • [4] Significance of Feature Selection for Acoustic Modeling in Dysarthric Speech Recognition
    Mathew, Jerin Baby
    Jacob, Jonie
    Sajeev, Karun
    Joy, Jithin
    Rajan, Rajeev
    2018 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, SIGNAL PROCESSING AND NETWORKING (WISPNET), 2018,
  • [5] Dysarthric Speech Recognition Using a Convolutive Bottleneck Network
    Nakashika, Toru
    Yoshioka, Toshiya
    Takiguchi, Tetsuya
    Ariki, Yasuo
    Duffner, Stefan
    Garcia, Christophe
    2014 12TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP), 2014, : 505 - 509
  • [6] Conceptual text region network: Cognition-inspired accurate scene text detection
    Cui, Chenwei
    Lu, Liangfu
    Tan, Zhiyuan
    Hussain, Amir
    NEUROCOMPUTING, 2021, 464 : 252 - 264
  • [7] Generative Model-Driven Feature Learning for dysarthric speech recognition
    Rajeswari, N.
    Chandrakala, S.
    BIOCYBERNETICS AND BIOMEDICAL ENGINEERING, 2016, 36 (04) : 553 - 561
  • [8] PHASE-BASED FEATURE REPRESENTATIONS FOR IMPROVING RECOGNITION OF DYSARTHRIC SPEECH
    Sehgal, Siddharth
    Cunningham, Stuart
    Green, Phil
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 13 - 20
  • [9] Dysarthric Speech Recognition Using Convolutional LSTM Neural Network
    Kim, Myungjong
    Cao, Beiming
    An, Kwanghoon
    Wang, Jun
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2948 - 2952
  • [10] Deep neural network architectures for dysarthric speech analysis and recognition
    Brahim Fares Zaidi
    Sid Ahmed Selouani
    Malika Boudraa
    Mohammed Sidi Yakoub
    Neural Computing and Applications, 2021, 33 : 9089 - 9108