Automatic Speech Disfluency Detection Using wav2vec2.0 for Different Languages with Variable Lengths

被引:2
|
作者
Liu, Jiajun [1 ,2 ]
Wumaier, Aishan [2 ,3 ]
Wei, Dongping [2 ,3 ]
Guo, Shen [2 ,3 ]
机构
[1] Xinjiang Univ, Coll Software, Urumqi 830046, Peoples R China
[2] Key Lab Multilingual Informat Technol Xinjiang Uyg, Urumqi 830046, Peoples R China
[3] Xinjiang Univ, Coll Informat Sci & Engn, Urumqi 830046, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 13期
关键词
speech disfluency detection; stuttering; limited data; wav2vec2.0; entropy invariance; CLASSIFICATION; DYSFLUENCIES;
D O I
10.3390/app13137579
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Speech is critical for interpersonal communication, but not everyone has fluent communication skills. Speech disfluency, including stuttering and interruptions, affects not only emotional expression but also clarity of expression for people who stutter. Existing methods for detecting speech disfluency rely heavily on annotated data, which can be costly. Additionally, these methods have not considered the issue of variable-length disfluent speech, which limits the scalability of detection methods. To address these limitations, this paper proposes an automated method for detecting speech disfluency that can improve communication skills for individuals and assist therapists in tracking the progress of stuttering patients. The proposed method focuses on detecting four types of disfluency features using single-task detection and utilizes embeddings from the pre-trained wav2vec2.0 model, as well as convolutional neural network (CNN) and Transformer models for feature extraction. The model's scalability is improved by considering the issue of variable-length disfluent speech and modifying the model based on the entropy invariance of attention mechanisms. The proposed automated method for detecting speech disfluency has the potential to assist individuals in overcoming speech disfluency, improve their communication skills, and aid therapists in tracking the progress of stuttering patients. Additionally, the model's scalability across languages and lengths enhances its practical applicability. The experiments demonstrate that the model outperforms baseline models in both English and Chinese datasets, proving its universality and scalability in real-world applications.
引用
收藏
页数:25
相关论文
共 50 条
  • [41] 基于Wav2vec2.0神经网络的轨道交通钢轨损伤压电阵列超声导波定位方法
    刘思昊
    钱鲁斌
    梅曜华
    邢宇辉
    城市轨道交通研究, 2023, 26 (06) : 101 - 105+110
  • [42] CCC-WAV2VEC 2.0: CLUSTERING AIDED CROSS CONTRASTIVE SELF-SUPERVISED LEARNING OF SPEECH REPRESENTATIONS
    Lodagala, Vasista Sai
    Ghosh, Sreyan
    Umesh, S.
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 1 - 8
  • [43] Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction
    Becerra, Helard
    Ragano, Alessandro
    Hines, Andrew
    INTERSPEECH 2022, 2022, : 4088 - 4092
  • [44] Classification of Vocal Intensity Category from Speech using the Wav2vec2 and Whisper Embeddings
    Kodali, Manila
    Kadiri, Sudarsana Reddy
    Alku, Paavo
    INTERSPEECH 2023, 2023, : 4134 - 4138
  • [45] Automatic Detection of Disfluency Boundaries in Spontaneous Speech of Children Using Audio-Visual Information
    Yildirim, Serdar
    Narayanan, Shrikanth
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2009, 17 (01): : 2 - 12
  • [46] Exploring the Impact of Fine-Tuning the Wav2vec2 Model in Database-Independent Detection of Dysarthric Speech
    Javanmardi, Farhad
    Kadiri, Sudarsana Reddy
    Alku, Paavo
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (08) : 4951 - 4962
  • [47] Automatic Classification of Parkinson's Disease Using Wav2vec Embeddings at Phoneme, Syllable, and Word Levels
    David Gallo-Aristizabal, Jeferson
    Escobar-Grisales, Daniel
    David Rios-Urrego, Cristian
    Noth, Elmar
    Rafael Orozco-Arroyave, Juan
    TEXT, SPEECH, AND DIALOGUE, TSD 2024, PT II, 2024, 15049 : 313 - 323
  • [48] Automatic detection of Parkinson's disease in running speech spoken in three different languages
    Orozco-Arroyave, J. R.
    Hoenig, F.
    Arias-Londono, J. D.
    Vargas-Bonilla, J. F.
    Daqrouq, K.
    Skodda, S.
    Rusz, J.
    Noeth, E.
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2016, 139 (01): : 481 - 500
  • [49] Assessment of Non-Native Speech Intelligibility using Wav2vec2-based Mispronunciation Detection and Multi-level Goodness of Pronunciation Transformer
    Shekar, Ram C. M. C.
    Yang, Mu
    Hirschi, Kevin
    Looney, Stephen
    Kang, Okim
    Hansen, John
    INTERSPEECH 2023, 2023, : 984 - 988
  • [50] Analyzing Wav2Vec 1.0 Embeddings for Cross-Database Parkinson's Disease Detection and Speech Features Extraction
    Klempir, Ondrej
    Krupicka, Radim
    SENSORS, 2024, 24 (17)