Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition

被引：19

作者：

Liu, Bin ^{[1
,2
]}

Nie, Shuai ^{[1
]}

Liang, Shan ^{[1
]}

Liu, Wenju ^{[1
]}

Yu, Meng ^{[3
]}

Chen, Lianwu ^{[4
]}

Peng, Shouye ^{[5
]}

Li, Changliang ^{[6
]}

机构：

[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China

[3] Tencent AI Lab, Bellevue, WA USA

[4] Tencent AI Lab, Shenzhen, Peoples R China

[5] Xueersi Online Sch, Beijing, Peoples R China

[6] Kingsoft AI Lab, Beijing, Peoples R China

来源：

INTERSPEECH 2019 | 2019年

基金：

中国国家自然科学基金;

关键词：

end-to-end speech recognition; robust speech recognition; speech enhancement; generative adversarial networks;

D O I：

10.21437/Interspeech.2019-1242

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Recently, the end-to-end system has made significant breakthroughs in the field of speech recognition. However, this single end-to-end architecture is not especially robust to the input variations interfered of noises and reverberations, resulting in performance degradation dramatically in reality. To alleviate this issue, the mainstream approach is to use a well-designed speech enhancement module as the front-end of ASR. However, enhancement modules would result in speech distortions and mismatches to training, which sometimes degrades the ASR performance. In this paper, we propose a jointly adversarial enhancement training to boost robustness of end-to-end systems. Specifically, we use a jointly compositional scheme of mask-based enhancement network, attention-based encoder-decoder network and discriminant network during training. The discriminator is used to distinguish between the enhanced features from enhancement network and clean features, which could guide enhancement network to output towards the realistic distribution. With the joint optimization of the recognition, enhancement and adversarial loss, the compositional scheme is expected to learn more robust representations for the recognition task automatically. Systematic experiments on AISHELL-1 show that the proposed method improves the noise robustness of end-to-end systems and achieves the relative error rate reduction of 4.6% over the multi-condition training.

引用

页码：491 / 495

页数：5

共 50 条

[31] Coarse-Grained Attention Fusion with Joint Training Framework for Complex Speech Enhancement and End-to-End Speech Recognition
Zhuang, Xuyi
Zhang, Lu
Zhang, Zehua
Qian, Yukun
Wang, Mingjiang
INTERSPEECH 2022, 2022, : 3794 - 3798
[32] End-to-end Accented Speech Recognition
Viglino, Thibault
Motlicek, Petr
Cernak, Milos
INTERSPEECH 2019, 2019, : 2140 - 2144
[33] Multichannel End-to-end Speech Recognition
Ochiai, Tsubasa
Watanabe, Shinji
Hori, Takaaki
Hershey, John R.
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
[34] END-TO-END AUDIOVISUAL SPEECH RECOGNITION
Petridis, Stavros
Stafylakis, Themos
Ma, Pingchuan
Cai, Feipeng
Tzimiropoulos, Georgios
Pantic, Maja
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6548 - 6552
[35] END-TO-END ANCHORED SPEECH RECOGNITION
Wang, Yiming
Fan, Xing
Chen, I-Fan
Liu, Yuzong
Chen, Tongfei
Hoffmeister, Bjorn
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7090 - 7094
[36] End-to-end multilingual speech recognition system with language supervision training
Liu, Danyang
Xu, Ji
Zhang, Pengyuan
IEICE Transactions on Information and Systems, 2020, E103D (06) : 1427 - 1430
[37] End-to-End Multilingual Speech Recognition System with Language Supervision Training
Liu, Danyang
Xu, Ji
Zhang, Pengyuan
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2020, E103D (06): : 1427 - 1430
[38] EXPLORING MODEL UNITS AND TRAINING STRATEGIES FOR END-TO-END SPEECH RECOGNITION
Huang, Mingkun
Lu, Yizhou
Wang, Lan
Qian, Yanmin
Yu, Kai
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 524 - 531
[39] Large Margin Training for Attention Based End-to-End Speech Recognition
Wang, Peidong
Cui, Jia
Weng, Chao
Yu, Dong
INTERSPEECH 2019, 2019, : 246 - 250
[40] Towards end-to-end training of automatic speech recognition for nigerian pidgin
Ajisafe, Daniel
Adegboro, Oluwabukola
Oduntan, Esther
Arulogun, Tayo
arXiv, 2020,

← 1 2 3 4 5 →