Predicting the Quality of Text-To-Speech Systems from a Large-Scale Feature Set

被引：0

作者：

Hinterleitner, Florian ^{[1
]}

Norrenbrock, Christoph R. ^{[2
]}

Moeller, Sebastian ^{[1
]}

Heute, Ulrich ^{[2
]}

机构：

[1] TU Berlin, Qual & Usabil Lab, Berlin, Germany

[2] CAU Kiel, Digital Signal Proc & Syst Theory, Kiel, Germany

来源：

14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5 | 2013年

关键词：

quality prediction; text-to-speech (TTS); cross-validation;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We extract 1495 speech features from 2 subjectively evaluated text-to-speech (TTS) databases. These features are extracted from pitch, loudness, MFCCs, spectrals, formants, and intensity. The speech material is synthesized using up to 15 different TTS systems, some of them with up to 8 different voices. We develop quality predictors for TTS signals following two different approaches to handle the huge set of speech features: a three-step feature selection followed by a stepwise multiple linear regression and an approach based on support vector machines. The predictors are cross-validated via 3-fold cross validation (CV) and leave-one-test-out (LOTO) CV. Due to the high number of features we apply a strict CV method where the partitioning is realized prior to the feature scaling and feature selection steps. In comparison we also follow a semi-strict approach where the partitioning effectively takes place after these steps. In the 3-fold CV case we achieve correlations as high as .75 for strict CV and .89 for semi-strict CV. The more ambitious LOTO CV yields correlations around .80 for the male speakers whereas the results for the female voices show the need for improvement.

引用

页码：383 / 387

页数：5

共 50 条

[1] Comparison of Approaches for Instrumentally Predicting the Quality of Text-To-Speech Systems
Moeller, Sebastian
Hinterleitner, Florian
Falk, Tiago H.
Polzehl, Tim
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 1325 - +
[2] Choice of Voices: A Large-Scale Evaluation of Text-to-Speech Voice Quality for Long-Form Content
Cambre, Julia
Colnago, Jessica
Maddock, Jim
Tsai, Janice
Kaye, Jofish
PROCEEDINGS OF THE 2020 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI'20), 2020,
[3] Enhancing the Quality of Nepali Text-to-Speech Systems
Ghimire, Rupak Raj
Bal, Bal Krishna
CREATIVITY IN INTELLIGENT TECHNOLOGIES AND DATA SCIENCE, (CIT&DS), 2017, 754 : 187 - 197
[4] Perceptual Quality Dimensions of Text-to-Speech Systems
Hinterleitner, Florian
Moeller, Sebastian
Norrenbrock, Christoph
Heute, Ulrich
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2188 - 2191
[5] Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus
Kim, Minchan
Jeong, Myeonghun
Choi, Byoung Jin
Ahn, Sunghwan
Lee, Joun Yeop
Kim, Nam Soo
INTERSPEECH 2022, 2022, : 788 - 792
[6] Text processing techniques for text-to-speech conversion systems to enhance the quality of synthesized speech
ATR Interpreting Telecommunications, Research Lab
NTT R&D, 10 (1011-1018):
[7] Comparison of measures of speech quality for listening tests of text-to-speech systems
Viswanathan, M
Viswanathan, M
PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, 2002, : 11 - 14
[8] Measuring speech quality for text-to-speech systems: development and assessment of a modified mean opinion score (MOS) scale
Viswanathan, M
Viswanathan, M
COMPUTER SPEECH AND LANGUAGE, 2005, 19 (01): : 55 - 83
[9] Physiological Quality-of-Experience Assessment of Text-to-Speech Systems
Gupta, Rishabh
Falk, Tiago H.
2016 IEEE 18TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2016,
[10] A text analyzer for Korean text-to-speech systems
Lee, SH
Oh, YH
ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1692 - 1695

← 1 2 3 4 5 →