Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition

被引:19
|
作者
Liu, Bin [1 ,2 ]
Nie, Shuai [1 ]
Liang, Shan [1 ]
Liu, Wenju [1 ]
Yu, Meng [3 ]
Chen, Lianwu [4 ]
Peng, Shouye [5 ]
Li, Changliang [6 ]
机构
[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] Tencent AI Lab, Bellevue, WA USA
[4] Tencent AI Lab, Shenzhen, Peoples R China
[5] Xueersi Online Sch, Beijing, Peoples R China
[6] Kingsoft AI Lab, Beijing, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
end-to-end speech recognition; robust speech recognition; speech enhancement; generative adversarial networks;
D O I
10.21437/Interspeech.2019-1242
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Recently, the end-to-end system has made significant breakthroughs in the field of speech recognition. However, this single end-to-end architecture is not especially robust to the input variations interfered of noises and reverberations, resulting in performance degradation dramatically in reality. To alleviate this issue, the mainstream approach is to use a well-designed speech enhancement module as the front-end of ASR. However, enhancement modules would result in speech distortions and mismatches to training, which sometimes degrades the ASR performance. In this paper, we propose a jointly adversarial enhancement training to boost robustness of end-to-end systems. Specifically, we use a jointly compositional scheme of mask-based enhancement network, attention-based encoder-decoder network and discriminant network during training. The discriminator is used to distinguish between the enhanced features from enhancement network and clean features, which could guide enhancement network to output towards the realistic distribution. With the joint optimization of the recognition, enhancement and adversarial loss, the compositional scheme is expected to learn more robust representations for the recognition task automatically. Systematic experiments on AISHELL-1 show that the proposed method improves the noise robustness of end-to-end systems and achieves the relative error rate reduction of 4.6% over the multi-condition training.
引用
收藏
页码:491 / 495
页数:5
相关论文
共 50 条
  • [41] Lightweight End-to-End Speech Enhancement Generative Adversarial Network Using Sinc Convolutions
    Li, Lujun
    Wudamu
    Kuerzinger, Ludwig
    Watzel, Tobias
    Rigoll, Gerhard
    APPLIED SCIENCES-BASEL, 2021, 11 (16):
  • [42] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
    Liu, Da-Rong
    Yang, Chi-Yu
    Wu, Szu-Lin
    Lee, Hung-Yi
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647
  • [43] Age-Invariant Training for End-to-End Child Speech Recognition using Adversarial Multi-Task Learning
    Rumberg, Lars
    Ehlert, Hanna
    Luedtke, Ulrike
    Ostermann, Joern
    INTERSPEECH 2021, 2021, : 3850 - 3854
  • [44] Spectrograms Fusion-based End-to-end Robust Automatic Speech Recognition
    Shi, Hao
    Wang, Longbiao
    Li, Sheng
    Fang, Cunhang
    Dang, Jianwu
    Kawahara, Tatsuya
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 438 - 442
  • [45] INTERACTIVE FEATURE FUSION FOR END-TO-END NOISE-ROBUST SPEECH RECOGNITION
    Hu, Yuchen
    Hou, Nana
    Chen, Chen
    Chng, Eng Siong
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6292 - 6296
  • [46] Adversarial Regularization for End-to-end Robust Speaker Verification
    Wang, Qing
    Guo, Pengcheng
    Sun, Sining
    Xie, Lei
    Hansen, John H. L.
    INTERSPEECH 2019, 2019, : 4010 - 4014
  • [47] End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation
    Chang, Xuankai
    Maekaku, Takashi
    Fujita, Yuya
    Watanabe, Shinji
    INTERSPEECH 2022, 2022, : 3819 - 3823
  • [48] Improving End-to-End Bangla Speech Recognition with Semi-supervised Training
    Sadeq, Nafis
    Chowdhury, Nafis Tahmid
    Utshaw, Farhan Tanvir
    Ahmed, Shafayat
    Adnan, Muhammad Abdullah
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1875 - 1883
  • [49] Improved Training for End-to-End Streaming Automatic Speech Recognition Model with Punctuation
    Kim, Hanbyul
    Seo, Seunghyun
    Lee, Lukas
    Baek, Seolki
    INTERSPEECH 2023, 2023, : 1653 - 1657
  • [50] Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition
    Shinohara, Yusuke
    Watanabe, Shinji
    INTERSPEECH 2022, 2022, : 2098 - 2102