Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition

被引:19
|
作者
Liu, Bin [1 ,2 ]
Nie, Shuai [1 ]
Liang, Shan [1 ]
Liu, Wenju [1 ]
Yu, Meng [3 ]
Chen, Lianwu [4 ]
Peng, Shouye [5 ]
Li, Changliang [6 ]
机构
[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] Tencent AI Lab, Bellevue, WA USA
[4] Tencent AI Lab, Shenzhen, Peoples R China
[5] Xueersi Online Sch, Beijing, Peoples R China
[6] Kingsoft AI Lab, Beijing, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
end-to-end speech recognition; robust speech recognition; speech enhancement; generative adversarial networks;
D O I
10.21437/Interspeech.2019-1242
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Recently, the end-to-end system has made significant breakthroughs in the field of speech recognition. However, this single end-to-end architecture is not especially robust to the input variations interfered of noises and reverberations, resulting in performance degradation dramatically in reality. To alleviate this issue, the mainstream approach is to use a well-designed speech enhancement module as the front-end of ASR. However, enhancement modules would result in speech distortions and mismatches to training, which sometimes degrades the ASR performance. In this paper, we propose a jointly adversarial enhancement training to boost robustness of end-to-end systems. Specifically, we use a jointly compositional scheme of mask-based enhancement network, attention-based encoder-decoder network and discriminant network during training. The discriminator is used to distinguish between the enhanced features from enhancement network and clean features, which could guide enhancement network to output towards the realistic distribution. With the joint optimization of the recognition, enhancement and adversarial loss, the compositional scheme is expected to learn more robust representations for the recognition task automatically. Systematic experiments on AISHELL-1 show that the proposed method improves the noise robustness of end-to-end systems and achieves the relative error rate reduction of 4.6% over the multi-condition training.
引用
收藏
页码:491 / 495
页数:5
相关论文
共 50 条
  • [21] CYCLE-CONSISTENCY TRAINING FOR END-TO-END SPEECH RECOGNITION
    Hori, Takaaki
    Astudillo, Ramon
    Hayashi, Tomoki
    Zhang, Yu
    Watanabe, Shinji
    Le Roux, Jonathan
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6271 - 6275
  • [22] Multitask Training with Text Data for End-to-End Speech Recognition
    Wang, Peidong
    Sainath, Tara N.
    Weiss, Ron J.
    INTERSPEECH 2021, 2021, : 2566 - 2570
  • [23] Improved training of end-to-end attention models for speech recognition
    Zeyer, Albert
    Irie, Kazuki
    Schlueter, Ralf
    Ney, Hermann
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 7 - 11
  • [24] Serialized Output Training for End-to-End Overlapped Speech Recognition
    Kanda, Naoyuki
    Gaur, Yashesh
    Wang, Xiaofei
    Meng, Zhong
    Yoshioka, Takuya
    INTERSPEECH 2020, 2020, : 2797 - 2801
  • [25] Perception-guided generative adversarial network for end-to-end speech enhancement
    Li, Yihao
    Sun, Meng
    Zhang, Xiongwei
    APPLIED SOFT COMPUTING, 2022, 128
  • [26] Noise-robust Attention Learning for End-to-End Speech Recognition
    Higuchi, Yosuke
    Tawara, Naohiro
    Ogawa, Atsunori
    Iwata, Tomoharu
    Kobayashi, Tetsunori
    Ogawa, Tetsuji
    28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 311 - 315
  • [27] AIPNET: GENERATIVE ADVERSARIAL PRE-TRAINING OF ACCENT-INVARIANT NETWORKS FOR END-TO-END SPEECH RECOGNITION
    Chen, Yi-Chen
    Yang, Zhaojun
    Yeh, Ching-Feng
    Jain, Mahaveer
    Seltzer, Michael L.
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6979 - 6983
  • [28] End-to-End Speech Recognition in Russian
    Markovnikov, Nikita
    Kipyatkova, Irina
    Lyakso, Elena
    SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 377 - 386
  • [29] END-TO-END MULTIMODAL SPEECH RECOGNITION
    Palaskar, Shruti
    Sanabria, Ramon
    Metze, Florian
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5774 - 5778
  • [30] Overview of end-to-end speech recognition
    Wang, Song
    Li, Guanyu
    2018 INTERNATIONAL SYMPOSIUM ON POWER ELECTRONICS AND CONTROL ENGINEERING (ISPECE 2018), 2019, 1187