UNSUPERVISED SPEECH ENHANCEMENT WITH SPEECH RECOGNITION EMBEDDING AND DISENTANGLEMENT LOSSES

被引:6
|
作者
Viet Anh Trinh [1 ]
Braun, Sebastian [2 ]
机构
[1] CUNY, Grad Ctr, New York, NY 10017 USA
[2] Microsoft Res, Redmond, WA USA
关键词
Speech enhancement; unsupervised learning; NETWORKS;
D O I
10.1109/ICASSP43922.2022.9746973
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech enhancement has recently achieved great success with various deep learning methods. However, most conventional speech enhancement systems are trained with supervised methods that impose two significant challenges. First, a majority of training datasets for speech enhancement systems are synthetic. When mixing clean speech and noisy corpora to create the synthetic datasets, domain mismatches occur between synthetic and real-world recordings of noisy speech or audio. Second, there is a trade-off between increasing speech enhancement performance and degrading speech recognition (ASR) performance. Thus, we propose an unsupervised loss function to tackle those two problems. Our function is developed by extending the MixIT loss function with speech recognition embedding and disentanglement loss. Our results show that the proposed function effectively improves the speech enhancement performance compared to a baseline trained in a supervised way on the noisy VoxCeleb dataset. While fully unsupervised training is unable to exceed the corresponding baseline, with joint super- and unsupervised training, the system is able to achieve similar speech quality and better ASR performance than the best supervised baseline.
引用
收藏
页码:391 / 395
页数:5
相关论文
共 50 条
  • [1] Unsupervised Speech Recognition
    Baevski, Alexei
    Hsu, Wei-Ning
    Conneau, Alexis
    Auli, Michael
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [2] Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition
    Ni, Junrui
    Wang, Liming
    Gao, Heting
    Qian, Kaizhi
    Zhang, Yang
    Chang, Shiyu
    Hasegawa-Johnson, Mark
    INTERSPEECH 2022, 2022, : 461 - 465
  • [3] FRONTEND ATTRIBUTES DISENTANGLEMENT FOR SPEECH EMOTION RECOGNITION
    Xi, Yu-Xuan
    Song, Yan
    Dai, Li-Rong
    McLoughlin, Ian
    Liu, Lin
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7712 - 7716
  • [4] A Theory of Unsupervised Speech Recognition
    Wang, Liming
    Hasegawa-Johnson, Mark
    Yoo, Chang D.
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 1192 - 1215
  • [5] Almost Unsupervised Text to Speech and Automatic Speech Recognition
    Ren, Yi
    Tan, Xu
    Qin, Tao
    Zhao, Sheng
    Zhao, Zhou
    Liu, Tie-Yan
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [6] β-Masking MMSE Speech Enhancement for Speech Recognition
    You, Chang Huai
    Ma, Bin
    2017 IEEE 2ND INTERNATIONAL CONFERENCE ON SIGNAL AND IMAGE PROCESSING (ICSIP), 2017, : 341 - 345
  • [7] NETWORKS FOR SPEECH ENHANCEMENT AND AUTOMATIC SPEECH RECOGNITION
    Vu, Thanh T.
    Bigot, Benjamin
    Chng, Eng Siong
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 499 - 503
  • [8] SPEECH ENHANCEMENT FOR TELEPHONY NAME SPEECH RECOGNITION
    You, Chang Huai
    Rahardja, Susanto
    Li, Haizhou
    2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, 2008, : 973 - 976
  • [9] Noisy speech recognition based on speech enhancement
    Wang, Xia
    Tang, Hongmei
    Zhao, Xiaoqun
    SNPD 2007: EIGHTH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING, AND PARALLEL/DISTRIBUTED COMPUTING, VOL 3, PROCEEDINGS, 2007, : 713 - +
  • [10] MODIFICATION ON LSA SPEECH ENHANCEMENT FOR SPEECH RECOGNITION
    You, Chang Huai
    Ma, Bin
    Ni, Chongjia
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5475 - 5479