A Dynamic Cost Weighting Framework for Unit Selection Text-to-Speech Synthesis

被引:9
|
作者
Bellegarda, Jerome R. [1 ]
机构
[1] Apple Comp Inc, Speech & Language Technol, Cupertino, CA 95014 USA
关键词
Candidate ranking; concatenation-specific cost weighting; concatenative speech synthesis; multiple information streams; unit selection;
D O I
10.1109/TASL.2009.2035209
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Unit selection text-to-speech synthesis relies on multiple cost criteria, each encapsulating a different aspect of acoustic and prosodic context at any given concatenation point. Constraints are normally invoked on diverse characteristics such as inter-unit discontinuity, overall pitch contour, local duration profile, etc., leading to costs often too heterogeneous for a direct quantitative comparison. In order to rank available candidate units, this complexity must be reduced to a single number, and the relative importance of each information stream becomes highly critical. Yet this influence is typically determined in an empirical manner (e. g., based on a limited amount of synthesized data), yielding global weights that are thus applied to broad classes of concatenations indiscriminately. This paper proposes an alternative approach, dynamic cost weighting, based on a data-driven framework separately optimized for each concatenation considered. Specifically, the cost distribution in every stream is dynamically leveraged on a per concatenation basis to locally shift weight towards those characteristics that offer a high discrimination between candidate units, and away from those characteristics that are intrinsically less discriminative. An illustrative case study demonstrates the potential benefits of this solution, and listening evidence suggests that it does indeed entail higher perceived TTS quality.
引用
收藏
页码:1455 / 1463
页数:9
相关论文
共 50 条
  • [41] An Advanced NLP Framework for High-Quality Text-to-Speech Synthesis
    Ungurean, Catalin
    Burileanu, Dragos
    2011 6TH CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN-COMPUTER DIALOGUE (SPED), 2011,
  • [42] PERCEPTUAL EVALUATION OF DYNAMIC COST WEIGHTING FOR UNIT SELECTION TTS
    Bellegarda, Jerome R.
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4806 - 4809
  • [43] A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept
    Marc Freixes
    Francesc Alías
    Joan Claudi Socoró
    EURASIP Journal on Audio, Speech, and Music Processing, 2019
  • [44] A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept
    Freixes, Marc
    Alias, Francesc
    Claudi Socoro, Joan
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2019, 2019 (01)
  • [45] A hybrid model for text-to-speech synthesis
    Violaro, F
    Boeffard, O
    IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1998, 6 (05): : 426 - 434
  • [46] Environment Aware Text-to-Speech Synthesis
    Tan, Daxin
    Zhang, Guangyan
    Lee, Tan
    INTERSPEECH 2022, 2022, : 481 - 485
  • [47] Text-to-speech synthesis integrated circuit
    Baskaya, IF
    Aktan, O
    Dündar, G
    PROCEEDINGS OF THE IEEE 12TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, 2004, : 653 - 656
  • [48] PHONETIC KNOWLEDGE IN TEXT-TO-SPEECH SYNTHESIS
    van Santen, Jan P. H.
    INTEGRATION OF PHONETIC KNOWLEDGE IN SPEECH TECHNOLOGY, 2005, 25 : 149 - 166
  • [49] Speaker-specific retraining for enhanced compression of unit selection text-to-speech databases
    Nurminen, Jani
    Silen, Hanna
    Gabbouj, Moncef
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 388 - 391
  • [50] RECENT IMPROVEMENTS OF PROBABILITY BASED PROSODY MODELS FOR UNIT SELECTION IN CONCATENATIVE TEXT-TO-SPEECH
    Zhang, Wei
    Gu, Liang
    Gao, Yuqing
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 3777 - 3780