Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition

被引:19
|
作者
Liu, Bin [1 ,2 ]
Nie, Shuai [1 ]
Liang, Shan [1 ]
Liu, Wenju [1 ]
Yu, Meng [3 ]
Chen, Lianwu [4 ]
Peng, Shouye [5 ]
Li, Changliang [6 ]
机构
[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] Tencent AI Lab, Bellevue, WA USA
[4] Tencent AI Lab, Shenzhen, Peoples R China
[5] Xueersi Online Sch, Beijing, Peoples R China
[6] Kingsoft AI Lab, Beijing, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
end-to-end speech recognition; robust speech recognition; speech enhancement; generative adversarial networks;
D O I
10.21437/Interspeech.2019-1242
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Recently, the end-to-end system has made significant breakthroughs in the field of speech recognition. However, this single end-to-end architecture is not especially robust to the input variations interfered of noises and reverberations, resulting in performance degradation dramatically in reality. To alleviate this issue, the mainstream approach is to use a well-designed speech enhancement module as the front-end of ASR. However, enhancement modules would result in speech distortions and mismatches to training, which sometimes degrades the ASR performance. In this paper, we propose a jointly adversarial enhancement training to boost robustness of end-to-end systems. Specifically, we use a jointly compositional scheme of mask-based enhancement network, attention-based encoder-decoder network and discriminant network during training. The discriminator is used to distinguish between the enhanced features from enhancement network and clean features, which could guide enhancement network to output towards the realistic distribution. With the joint optimization of the recognition, enhancement and adversarial loss, the compositional scheme is expected to learn more robust representations for the recognition task automatically. Systematic experiments on AISHELL-1 show that the proposed method improves the noise robustness of end-to-end systems and achieves the relative error rate reduction of 4.6% over the multi-condition training.
引用
收藏
页码:491 / 495
页数:5
相关论文
共 50 条
  • [31] Coarse-Grained Attention Fusion with Joint Training Framework for Complex Speech Enhancement and End-to-End Speech Recognition
    Zhuang, Xuyi
    Zhang, Lu
    Zhang, Zehua
    Qian, Yukun
    Wang, Mingjiang
    INTERSPEECH 2022, 2022, : 3794 - 3798
  • [32] End-to-end Accented Speech Recognition
    Viglino, Thibault
    Motlicek, Petr
    Cernak, Milos
    INTERSPEECH 2019, 2019, : 2140 - 2144
  • [33] Multichannel End-to-end Speech Recognition
    Ochiai, Tsubasa
    Watanabe, Shinji
    Hori, Takaaki
    Hershey, John R.
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [34] END-TO-END AUDIOVISUAL SPEECH RECOGNITION
    Petridis, Stavros
    Stafylakis, Themos
    Ma, Pingchuan
    Cai, Feipeng
    Tzimiropoulos, Georgios
    Pantic, Maja
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6548 - 6552
  • [35] END-TO-END ANCHORED SPEECH RECOGNITION
    Wang, Yiming
    Fan, Xing
    Chen, I-Fan
    Liu, Yuzong
    Chen, Tongfei
    Hoffmeister, Bjorn
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7090 - 7094
  • [36] End-to-end multilingual speech recognition system with language supervision training
    Liu, Danyang
    Xu, Ji
    Zhang, Pengyuan
    IEICE Transactions on Information and Systems, 2020, E103D (06) : 1427 - 1430
  • [37] End-to-End Multilingual Speech Recognition System with Language Supervision Training
    Liu, Danyang
    Xu, Ji
    Zhang, Pengyuan
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2020, E103D (06): : 1427 - 1430
  • [38] EXPLORING MODEL UNITS AND TRAINING STRATEGIES FOR END-TO-END SPEECH RECOGNITION
    Huang, Mingkun
    Lu, Yizhou
    Wang, Lan
    Qian, Yanmin
    Yu, Kai
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 524 - 531
  • [39] Large Margin Training for Attention Based End-to-End Speech Recognition
    Wang, Peidong
    Cui, Jia
    Weng, Chao
    Yu, Dong
    INTERSPEECH 2019, 2019, : 246 - 250
  • [40] Towards end-to-end training of automatic speech recognition for nigerian pidgin
    Ajisafe, Daniel
    Adegboro, Oluwabukola
    Oduntan, Esther
    Arulogun, Tayo
    arXiv, 2020,