A NOISE-ROBUST SELF-SUPERVISED PRE-TRAINING MODEL BASED SPEECH REPRESENTATION LEARNING FOR AUTOMATIC SPEECH RECOGNITION

被引:20
|
作者
Zhu, Qiu-Shi [1 ]
Zhang, Jie [1 ,2 ]
Zhang, Zi-Qiang [1 ]
Wu, Ming-Hui [1 ]
Fang, Xin [1 ]
Dai, Li-Rong [1 ]
机构
[1] Univ Sci & Technol China USTC, NEL SLIP, Hefei, Peoples R China
[2] Chinese Acad Sci, Inst Acoust, State Key Lab Acoust, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Wav2vec2.0; speech recognition; noise robustness; self-supervised pre-training; speech representation;
D O I
10.1109/ICASSP43922.2022.9747379
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Wav2vec2.0 is a popular self-supervised pre-training framework for learning speech representations in the context of automatic speech recognition (ASR). It was shown that wav2vec2.0 has a good robustness against the domain shift, while the noise robustness is still unclear. In this work, we therefore first analyze the noise robustness of wav2vec2.0 via experiments. We observe that wav2vec2.0 pre-trained on noisy data can obtain good representations and thus improve the ASR performance on the noisy test set, which however brings a performance degradation on the clean test set. To avoid this issue, in this work we propose an enhanced wav2vec2.0 model. Specifically, the noisy speech and the corresponding clean version are fed into the same feature encoder, where the clean speech provides training targets for the model. Experimental results reveal that the proposed method can not only improve the ASR performance on the noisy test set which surpasses the original wav2vec2.0, but also ensure a tiny performance decrease on the clean test set. In addition, the effectiveness of the proposed method is demonstrated under different types of noise conditions.
引用
收藏
页码:3174 / 3178
页数:5
相关论文
共 50 条
  • [31] Selective HuBERT: Self-Supervised Pre-Training for Target Speaker in Clean and Mixture Speech
    Lin, Jingru
    Ge, Meng
    Wang, Wupeng
    Li, Haizhou
    Feng, Mengling
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 1014 - 1018
  • [32] Noise-Robust speech recognition of Conversational Telephone Speech
    Chen, Gang
    Tolba, Hesham
    O'Shaughnessy, Douglas
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1101 - 1104
  • [33] An engineering model of the masking for the noise-robust speech recognition
    Park, KY
    Lee, SY
    NEUROCOMPUTING, 2003, 52-4 : 615 - 620
  • [34] Self-supervised Representation Fusion for Speech and Wearable Based Emotion Recognition
    Dissanayake, Vipula
    Seneviratne, Sachith
    Suriyaarachchi, Hussel
    Wen, Elliott
    Nanayakkara, Suranga
    INTERSPEECH 2022, 2022, : 3598 - 3602
  • [35] OTF: Optimal Transport based Fusion of Supervised and Self-Supervised Learning Models for Automatic Speech Recognition
    Fu, Li
    Li, Siqi
    Li, Qingtao
    Li, Fangzhu
    Deng, Liping
    Fan, Lu
    Chen, Meng
    Wu, Youzheng
    He, Xiaodong
    INTERSPEECH 2023, 2023, : 934 - 938
  • [36] Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages
    Rouditchenko, Andrew
    Khurana, Sameer
    Thomas, Samuel
    Feris, Rogerio
    Karlinsky, Leonid
    Kuehne, Hilde
    Harwath, David
    Kingsbury, Brian
    Glass, James
    INTERSPEECH 2023, 2023, : 2268 - 2272
  • [37] Representation Recovering for Self-Supervised Pre-training on Medical Images
    Yan, Xiangyi
    Naushad, Junayed
    Sun, Shanlin
    Han, Kun
    Tang, Hao
    Kong, Deying
    Ma, Haoyu
    You, Chenyu
    Xie, Xiaohui
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2684 - 2694
  • [38] EFFICIENT ADAPTER TRANSFER OF SELF-SUPERVISED SPEECH MODELS FOR AUTOMATIC SPEECH RECOGNITION
    Thomas, Bethan
    Kessler, Samuel
    Karout, Salah
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7102 - 7106
  • [39] Self-Supervised Learning With Segmental Masking for Speech Representation
    Yue, Xianghu
    Lin, Jingru
    Gutierrez, Fabian Ritter
    Li, Haizhou
    IEEE Journal on Selected Topics in Signal Processing, 2022, 16 (06): : 1367 - 1379
  • [40] Self-Supervised Learning With Segmental Masking for Speech Representation
    Yue, Xianghu
    Lin, Jingru
    Gutierrez, Fabian Ritter
    Li, Haizhou
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1367 - 1379