Multimodal Unsupervised Speech Translation for Recognizing and Evaluating Second Language Speech

被引：5

作者：

Lee, Yun Kyung ^{[1
]}

Park, Jeon Gue ^{[1
]}

机构：

[1] Elect & Telecommun Res Inst ETRI, Artificial Intelligence Res Lab, Daejeon 34129, South Korea

来源：

APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 06期

关键词：

fluency evaluation; speech recognition; data augmentation; variational autoencoder; speech conversion; NONPARALLEL VOICE CONVERSION; BLIND SEPARATION; RECOGNITION;

D O I：

10.3390/app11062642

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

This paper addresses an automatic proficiency evaluation and speech recognition for second language (L2) speech. The proposed method recognizes the speech uttered by the L2 speaker, measures a variety of fluency scores, and evaluates the proficiency of the speaker's spoken English. Stress and rhythm scores are one of the important factors used to evaluate fluency in spoken English and are computed by comparing the stress patterns and the rhythm distributions to those of native speakers. In order to compute the stress and rhythm scores even when the phonemic sequence of the L2 speaker's English sentence is different from the native speaker's one, we align the phonemic sequences based on a dynamic time-warping approach. We also improve the performance of the speech recognition system for non-native speakers and compute fluency features more accurately by augmenting the non-native training dataset and training an acoustic model with the augmented dataset. In this work, we augment the non-native speech by converting some speech signal characteristics (style) while preserving its linguistic information. The proposed variational autoencoder (VAE)-based speech conversion network trains the conversion model by decomposing the spectral features of the speech into a speaker-invariant content factor and a speaker-specific style factor to estimate diverse and robust speech styles. Experimental results show that the proposed method effectively measures the fluency scores and generates diverse output signals. Also, in the proficiency evaluation and speech recognition tests, the proposed method improves the proficiency score performance and speech recognition accuracy for all proficiency areas compared to a method employing conventional acoustic models.

引用

页数：16

共 50 条

[31] Age of learning and second language speech
Flege, JE
SECOND LANGUAGE ACQUISITION AND THE CRITICAL PERIOD HYPOTHESIS, 1999, : 101 - 131
[32] AwezaMed: A Multilingual, Multimodal Speech-To-Speech Translation Application for Maternal Health Care
Marais, Laurette
Louw, Johannes A.
Badenhorst, Jaco
Calteaux, Karen
Wilken, Ilana
van Niekerk, Nina
Stein, Glenn
PROCEEDINGS OF 2020 23RD INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION 2020), 2020, : 669 - 676
[33] Evaluating context-invariance in unsupervised speech representations
Hallap, Mark
Dupoux, Emmanuel
Dunbar, Ewan
INTERSPEECH 2023, 2023, : 2973 - 2977
[34] Structured GMM Based on Unsupervised Clustering for Recognizing Adult and Child Speech
Gorin, Arseniy
Jouvet, Denis
STATISTICAL LANGUAGE AND SPEECH PROCESSING, SLSP 2014, 2014, 8791 : 108 - 119
[35] Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation
Jia, Ye
Ding, Yifan
Bapna, Ankur
Cherry, Colin
Zhang, Yu
Conneau, Alexis
Morioka, Nobuyuki
INTERSPEECH 2022, 2022, : 1721 - 1725
[36] Evaluation of Alternatives on Speech to Sign Language Translation
San-Segundo, R.
Perez, A.
Ortiz, D.
D'Haro, L. F.
Torres, M. I.
Casacuberta, F.
INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 53 - +
[37] Speech to sign language translation system for Spanish
San-Segundo, R.
Barra, R.
Cordoba, R.
D'Haro, L. F.
Fernandez, F.
Ferreiros, J.
Lucas, J. M.
Macias-Guarasa, J.
Montero, J. M.
Pardo, J. M.
SPEECH COMMUNICATION, 2008, 50 (11-12) : 1009 - 1020
[38] Towards language portability in statistical speech translation
Waibel, A
Schultz, T
Vogel, S
Fügen, C
Honal, M
Kolss, M
Reichert, J
Stüker, S
2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL III, PROCEEDINGS: IMAGE AND MULTIDIMENSIONAL SIGNAL PROCESSING SPECIAL SESSIONS, 2004, : 765 - 768
[39] Language awareness and perception of connected speech in a second language
Kennedy, Sara
Blanchet, Josee
LANGUAGE AWARENESS, 2014, 23 (1-2) : 91 - 105
[40] Cross-Language Activation Begins During Speech Planning and Extends Into Second Language Speech
Jacobs, April
Fricke, Melinda
Kroll, Judith F.
LANGUAGE LEARNING, 2016, 66 (02) : 324 - 353

← 1 2 3 4 5 →