Predicting the Quality of Text-To-Speech Systems from a Large-Scale Feature Set

被引：0

作者：

Hinterleitner, Florian ^{[1
]}

Norrenbrock, Christoph R. ^{[2
]}

Moeller, Sebastian ^{[1
]}

Heute, Ulrich ^{[2
]}

机构：

[1] TU Berlin, Qual & Usabil Lab, Berlin, Germany

[2] CAU Kiel, Digital Signal Proc & Syst Theory, Kiel, Germany

来源：

14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5 | 2013年

关键词：

quality prediction; text-to-speech (TTS); cross-validation;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We extract 1495 speech features from 2 subjectively evaluated text-to-speech (TTS) databases. These features are extracted from pitch, loudness, MFCCs, spectrals, formants, and intensity. The speech material is synthesized using up to 15 different TTS systems, some of them with up to 8 different voices. We develop quality predictors for TTS signals following two different approaches to handle the huge set of speech features: a three-step feature selection followed by a stepwise multiple linear regression and an approach based on support vector machines. The predictors are cross-validated via 3-fold cross validation (CV) and leave-one-test-out (LOTO) CV. Due to the high number of features we apply a strict CV method where the partitioning is realized prior to the feature scaling and feature selection steps. In comparison we also follow a semi-strict approach where the partitioning effectively takes place after these steps. In the 3-fold CV case we achieve correlations as high as .75 for strict CV and .89 for semi-strict CV. The more ambitious LOTO CV yields correlations around .80 for the male speakers whereas the results for the female voices show the need for improvement.

引用

页码：383 / 387

页数：5

共 50 条

[41] EVALUATING TEXT-TO-SPEECH SYSTEMS - SOME METHODOLOGICAL ASPECTS
VANBEZOOIJEN, R
POLS, LCW
SPEECH COMMUNICATION, 1990, 9 (04) : 263 - 270
[42] Neural networks in text-to-speech systems for the Greek language
Falas, T
Stafylopatis, AG
MELECON 2000: INFORMATION TECHNOLOGY AND ELECTROTECHNOLOGY FOR THE MEDITERRANEAN COUNTRIES, VOLS 1-3, PROCEEDINGS, 2000, : 574 - 577
[43] Syllable duration prediction for Farsi text-to-speech systems
Nazari, B.
Nayebi, K.
Sheikhzadeh, H.
Scientia Iranica, 2004, 11 (03) : 225 - 233
[44] Predicting causal effects in large-scale systems from observational data
Maathuis, Marloes H.
Colombo, Diego
Kalisch, Markus
Buehlmann, Peter
NATURE METHODS, 2010, 7 (04) : 247 - 248
[45] Predicting causal effects in large-scale systems from observational data
Marloes H Maathuis
Diego Colombo
Markus Kalisch
Peter Bühlmann
Nature Methods, 2010, 7 : 247 - 248
[46] Automatic Speech Recognition Used for Intelligibility Assessment of Text-to-Speech Systems
Vich, Robert
Nouza, Jan
Vondra, Martin
VERBAL AND NONVERBAL FEATURES OF HUMAN-HUMAN AND HUMAN-MACHINE INTERACTIONS, 2008, 5042 : 136 - +
[47] A Large-Scale Comparison of Historical Text Normalization Systems
Bollmann, Marcel
2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 3885 - 3898
[48] Learning Speaker Embedding from Text-to-Speech
Cho, Jaejin
Zelasko, Piotr
Villalba, Jesus
Watanabe, Shinji
Dehak, Najim
INTERSPEECH 2020, 2020, : 3256 - 3260
[49] Control of intonation in HMM based text-to-speech systems
Cai, L. (clh-dcs@tsinghua.edu.cn), 1600, Tsinghua University (53):
[50] INTELLIGIBILITY OF SPEECH PRODUCED BY TEXT-TO-SPEECH SYSTEMS IN GOOD AND TELEPHONIC CONDITIONS
DELOGU, C
PAOLONI, A
RIDOLFI, P
VAGGES, K
ACTA ACUSTICA, 1995, 3 (01): : 89 - 96

← 1 2 3 4 5 →