Predicting the Quality of Text-To-Speech Systems from a Large-Scale Feature Set

被引:0
|
作者
Hinterleitner, Florian [1 ]
Norrenbrock, Christoph R. [2 ]
Moeller, Sebastian [1 ]
Heute, Ulrich [2 ]
机构
[1] TU Berlin, Qual & Usabil Lab, Berlin, Germany
[2] CAU Kiel, Digital Signal Proc & Syst Theory, Kiel, Germany
关键词
quality prediction; text-to-speech (TTS); cross-validation;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We extract 1495 speech features from 2 subjectively evaluated text-to-speech (TTS) databases. These features are extracted from pitch, loudness, MFCCs, spectrals, formants, and intensity. The speech material is synthesized using up to 15 different TTS systems, some of them with up to 8 different voices. We develop quality predictors for TTS signals following two different approaches to handle the huge set of speech features: a three-step feature selection followed by a stepwise multiple linear regression and an approach based on support vector machines. The predictors are cross-validated via 3-fold cross validation (CV) and leave-one-test-out (LOTO) CV. Due to the high number of features we apply a strict CV method where the partitioning is realized prior to the feature scaling and feature selection steps. In comparison we also follow a semi-strict approach where the partitioning effectively takes place after these steps. In the 3-fold CV case we achieve correlations as high as .75 for strict CV and .89 for semi-strict CV. The more ambitious LOTO CV yields correlations around .80 for the male speakers whereas the results for the female voices show the need for improvement.
引用
收藏
页码:383 / 387
页数:5
相关论文
共 50 条
  • [41] EVALUATING TEXT-TO-SPEECH SYSTEMS - SOME METHODOLOGICAL ASPECTS
    VANBEZOOIJEN, R
    POLS, LCW
    SPEECH COMMUNICATION, 1990, 9 (04) : 263 - 270
  • [42] Neural networks in text-to-speech systems for the Greek language
    Falas, T
    Stafylopatis, AG
    MELECON 2000: INFORMATION TECHNOLOGY AND ELECTROTECHNOLOGY FOR THE MEDITERRANEAN COUNTRIES, VOLS 1-3, PROCEEDINGS, 2000, : 574 - 577
  • [43] Syllable duration prediction for Farsi text-to-speech systems
    Nazari, B.
    Nayebi, K.
    Sheikhzadeh, H.
    Scientia Iranica, 2004, 11 (03) : 225 - 233
  • [44] Predicting causal effects in large-scale systems from observational data
    Maathuis, Marloes H.
    Colombo, Diego
    Kalisch, Markus
    Buehlmann, Peter
    NATURE METHODS, 2010, 7 (04) : 247 - 248
  • [45] Predicting causal effects in large-scale systems from observational data
    Marloes H Maathuis
    Diego Colombo
    Markus Kalisch
    Peter Bühlmann
    Nature Methods, 2010, 7 : 247 - 248
  • [46] Automatic Speech Recognition Used for Intelligibility Assessment of Text-to-Speech Systems
    Vich, Robert
    Nouza, Jan
    Vondra, Martin
    VERBAL AND NONVERBAL FEATURES OF HUMAN-HUMAN AND HUMAN-MACHINE INTERACTIONS, 2008, 5042 : 136 - +
  • [47] A Large-Scale Comparison of Historical Text Normalization Systems
    Bollmann, Marcel
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 3885 - 3898
  • [48] Learning Speaker Embedding from Text-to-Speech
    Cho, Jaejin
    Zelasko, Piotr
    Villalba, Jesus
    Watanabe, Shinji
    Dehak, Najim
    INTERSPEECH 2020, 2020, : 3256 - 3260
  • [49] Control of intonation in HMM based text-to-speech systems
    Cai, L. (clh-dcs@tsinghua.edu.cn), 1600, Tsinghua University (53):
  • [50] INTELLIGIBILITY OF SPEECH PRODUCED BY TEXT-TO-SPEECH SYSTEMS IN GOOD AND TELEPHONIC CONDITIONS
    DELOGU, C
    PAOLONI, A
    RIDOLFI, P
    VAGGES, K
    ACTA ACUSTICA, 1995, 3 (01): : 89 - 96