A NOISE-ROBUST SELF-SUPERVISED PRE-TRAINING MODEL BASED SPEECH REPRESENTATION LEARNING FOR AUTOMATIC SPEECH RECOGNITION

被引:20
|
作者
Zhu, Qiu-Shi [1 ]
Zhang, Jie [1 ,2 ]
Zhang, Zi-Qiang [1 ]
Wu, Ming-Hui [1 ]
Fang, Xin [1 ]
Dai, Li-Rong [1 ]
机构
[1] Univ Sci & Technol China USTC, NEL SLIP, Hefei, Peoples R China
[2] Chinese Acad Sci, Inst Acoust, State Key Lab Acoust, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Wav2vec2.0; speech recognition; noise robustness; self-supervised pre-training; speech representation;
D O I
10.1109/ICASSP43922.2022.9747379
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Wav2vec2.0 is a popular self-supervised pre-training framework for learning speech representations in the context of automatic speech recognition (ASR). It was shown that wav2vec2.0 has a good robustness against the domain shift, while the noise robustness is still unclear. In this work, we therefore first analyze the noise robustness of wav2vec2.0 via experiments. We observe that wav2vec2.0 pre-trained on noisy data can obtain good representations and thus improve the ASR performance on the noisy test set, which however brings a performance degradation on the clean test set. To avoid this issue, in this work we propose an enhanced wav2vec2.0 model. Specifically, the noisy speech and the corresponding clean version are fed into the same feature encoder, where the clean speech provides training targets for the model. Experimental results reveal that the proposed method can not only improve the ASR performance on the noisy test set which surpasses the original wav2vec2.0, but also ensure a tiny performance decrease on the clean test set. In addition, the effectiveness of the proposed method is demonstrated under different types of noise conditions.
引用
收藏
页码:3174 / 3178
页数:5
相关论文
共 50 条
  • [21] Self-Supervised Speech Representation Learning: A Review
    Mohamed, Abdelrahman
    Lee, Hung-yi
    Borgholt, Lasse
    Havtorn, Jakob D.
    Edin, Joakim
    Igel, Christian
    Kirchhoff, Katrin
    Li, Shang-Wen
    Livescu, Karen
    Maaloe, Lars
    Sainath, Tara N.
    Watanabe, Shinji
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1179 - 1210
  • [22] Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation
    Popuri, Sravya
    Chen, Peng-Jen
    Wang, Changhan
    Pino, Juan
    Adi, Yossi
    Gu, Jiatao
    Hsu, Wei-Ning
    Lee, Ann
    INTERSPEECH 2022, 2022, : 5195 - 5199
  • [23] EXPLORING THE INTEGRATION OF SPEECH SEPARATION AND RECOGNITION WITH SELF-SUPERVISED LEARNING REPRESENTATION
    Masuyama, Yoshiki
    Chang, Xuankai
    Zhang, Wangyou
    Cornell, Samuele
    Wang, Zhong-Qiu
    Ono, Nobutaka
    Qian, Yanmin
    Watanabe, Shinji
    2023 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, WASPAA, 2023,
  • [24] Noise-Robust Algorithm of Speech Features Extraction for Automatic Speech Recognition System
    Yakhnev, A. N.
    Pisarev, A. S.
    PROCEEDINGS OF THE XIX IEEE INTERNATIONAL CONFERENCE ON SOFT COMPUTING AND MEASUREMENTS (SCM 2016), 2016, : 206 - 208
  • [25] MULTI-TASK SELF-SUPERVISED LEARNING FOR ROBUST SPEECH RECOGNITION
    Ravanelli, Mirco
    Zhong, Jianyuan
    Pascual, Santiago
    Swietojanski, Pawel
    Monteiro, Joao
    Trmal, Jan
    Bengio, Yoshua
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6989 - 6993
  • [26] End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation
    Chang, Xuankai
    Maekaku, Takashi
    Fujita, Yuya
    Watanabe, Shinji
    INTERSPEECH 2022, 2022, : 3819 - 3823
  • [27] INCORPORATING MASK MODELLING FOR NOISE-ROBUST AUTOMATIC SPEECH RECOGNITION
    Koekueer, Muenevver
    Jancovic, Peter
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 3929 - 3932
  • [28] Empirical Mode Decomposition For Noise-Robust Automatic Speech Recognition
    Wu, Kuo-Hao
    Chen, Chia-Ping
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2074 - 2077
  • [29] Masked self-supervised pre-training model for EEG-based emotion recognition
    Hu, Xinrong
    Chen, Yu
    Yan, Jinlin
    Wu, Yuan
    Ding, Lei
    Xu, Jin
    Cheng, Jun
    COMPUTATIONAL INTELLIGENCE, 2024, 40 (03)
  • [30] A companding front end for noise-robust automatic speech recognition
    Guinness, J
    Raj, B
    Schmidt-Nielsen, B
    Turicchia, L
    Sarpeshkar, R
    2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5: SPEECH PROCESSING, 2005, : 249 - 252