SEMI-SUPERVISED TRAINING FOR END-TO-END MODELS VIA WEAK DISTILLATION

被引：0

作者：

Li, Bo ^{[1
]}

Sainath, Tara N. ^{[1
]}

Pang, Ruoming ^{[1
]}

Wu, Zelin ^{[1
]}

机构：

[1] Google LLC, Mountain View, CA 94043 USA

来源：

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年

关键词：

semi-supervised training; sequence to sequence;

D O I：

10.1109/icassp.2019.8682172

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

End-to-end (E2E) models are a promising research direction in speech recognition, as the single all-neural E2E system offers a much simpler and more compact solution compared to a conventional model, which has a separate acoustic (AM), pronunciation (PM) and language model (LM). However, it has been noted that E2E models perform poorly on tail words and proper nouns, likely because the end-to-end optimization requires joint audio-text pairs, and does not take advantage of additional lexicons and large amounts of text-only data used to train the LMs in conventional models. There has been numerous efforts in training an RNN-LM on text-only data and fusing it into the end-to-end model. In this work, we contrast this approach to training the E2E model with audio-text pairs generated from unsupervised speech data. To target the proper noun issue specifically, we adopt a Part-of-Speech (POS) tagger to filter the unsupervised data to use only those with proper nouns. We show that training with filtered unsupervised-data provides up to a 13% relative reduction in word-error-rate (WER), and when used in conjunction with a cold-fusion RNN-LM, up to a 17% relative improvement.

引用

页码：2837 / 2841

页数：5

共 50 条

[1] Semi-supervised ASR by End-to-end Self-training
Chen, Yang
Wang, Weiran
Wang, Chao
INTERSPEECH 2020, 2020, : 2787 - 2791
[2] Improving End-to-End Bangla Speech Recognition with Semi-supervised Training
Sadeq, Nafis
Chowdhury, Nafis Tahmid
Utshaw, Farhan Tanvir
Ahmed, Shafayat
Adnan, Muhammad Abdullah
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1875 - 1883
[3] Semi-Supervised End-to-End Speech Recognition
Karita, Shigeki
Watanabe, Shinji
Iwata, Tomoharu
Ogawa, Atsunori
Delcroix, Marc
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2 - 6
[4] Semi-Supervised Training with Pseudo-Labeling for End-to-End Neural Diarization
Takashima, Yuki
Fujita, Yusuke
Horiguchi, Shota
Watanabe, Shinji
Garcia, Paola
Nagamatsu, Kenji
INTERSPEECH 2021, 2021, : 3096 - 3100
[5] SEMI-SUPERVISED TRAINING FOR IMPROVING DATA EFFICIENCY IN END-TO-END SPEECH SYNTHESIS
Chung, Yu-An
Wang, Yuxuan
Hsu, Wei-Ning
Zhang, Yu
Skerry-Ryan, R. J.
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6940 - 6944
[6] SEMI-SUPERVISED SPEAKER ADAPTATION FOR END-TO-END SPEECH SYNTHESIS WITH PRETRAINED MODELS
Inoue, Katsuki
Hara, Sunao
Abe, Masanobu
Hayashi, Tomoki
Yamamoto, Ryuichi
Watanabe, Shinji
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7634 - 7638
[7] SEMI-SUPERVISED END-TO-END SPEECH RECOGNITION VIA LOCAL PRIOR MATCHING
Hsu, Wei-Ning
Lee, Ann
Synnaeve, Gabriel
Hannun, Awni
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 125 - 132
[8] End-to-End Emotional Speech Synthesis Using Style Tokens and Semi-Supervised Training
Wu, Pengfei
Ling, Zhenhua
Liu, Lijuan
Jiang, Yuan
Wu, Hongchuan
Dai, Lirong
2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 623 - 627
[9] Exploiting semi-supervised training through a dropout regularization in end-to-end speech recognition
Dey, Subhadeep
Motlicek, Petr
Bui, Trung
Dernoncourt, Franck
INTERSPEECH 2019, 2019, : 734 - 738
[10] SEMI-SUPERVISED LEARNING BASED ON HIERARCHICAL GENERATIVE MODELS FOR END-TO-END SPEECH SYNTHESIS
Fujimoto, Takato
Takaki, Shinji
Hashimoto, Kei
Oura, Keiichiro
Nankaku, Yoshihiko
Tokuda, Keiichi
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7644 - 7648

← 1 2 3 4 5 →