A Dynamic Cost Weighting Framework for Unit Selection Text-to-Speech Synthesis

被引:9
|
作者
Bellegarda, Jerome R. [1 ]
机构
[1] Apple Comp Inc, Speech & Language Technol, Cupertino, CA 95014 USA
关键词
Candidate ranking; concatenation-specific cost weighting; concatenative speech synthesis; multiple information streams; unit selection;
D O I
10.1109/TASL.2009.2035209
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Unit selection text-to-speech synthesis relies on multiple cost criteria, each encapsulating a different aspect of acoustic and prosodic context at any given concatenation point. Constraints are normally invoked on diverse characteristics such as inter-unit discontinuity, overall pitch contour, local duration profile, etc., leading to costs often too heterogeneous for a direct quantitative comparison. In order to rank available candidate units, this complexity must be reduced to a single number, and the relative importance of each information stream becomes highly critical. Yet this influence is typically determined in an empirical manner (e. g., based on a limited amount of synthesized data), yielding global weights that are thus applied to broad classes of concatenations indiscriminately. This paper proposes an alternative approach, dynamic cost weighting, based on a data-driven framework separately optimized for each concatenation considered. Specifically, the cost distribution in every stream is dynamically leveraged on a per concatenation basis to locally shift weight towards those characteristics that offer a high discrimination between candidate units, and away from those characteristics that are intrinsically less discriminative. An illustrative case study demonstrates the potential benefits of this solution, and listening evidence suggests that it does indeed entail higher perceived TTS quality.
引用
收藏
页码:1455 / 1463
页数:9
相关论文
共 50 条
  • [1] Syllable specific unit selection cost functions for text-to-speech synthesis
    Narendra, N.P.
    Sreenivasa Rao, K.
    ACM Transactions on Speech and Language Processing, 2012, 9 (03):
  • [2] A global, boundary-centric framework for unit selection text-to-speech synthesis
    Bellegarda, JR
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (03): : 990 - 997
  • [3] Efficient Unit-Selection in Text-to-Speech Synthesis
    Mihelic, Ales
    Gros, Jerneja Zganec
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2008, 5246 : 411 - 418
  • [4] Embedded Unit Selection Text-to-Speech Synthesis for Mobile Devices
    Karabetsos, Sotiris
    Tsiakoulis, Pirros
    Chalamandaris, Aimilios
    Raptis, Spyros
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2009, 55 (02) : 613 - 621
  • [5] Continuity Metric for Unit Selection based Text-to-Speech Synthesis
    Lakkavalli, Vikram Ramesh
    Arulmozhi, P.
    Ramakrishnan, A. G.
    2010 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS (SPCOM), 2010,
  • [6] An Overview of the ILSP Unit Selection Text-to-Speech Synthesis System
    Tsiakoulis, Pirros
    Karabetsos, Sotiris
    Chalamandaris, Aimilios
    Raptis, Spyros
    ARTIFICIAL INTELLIGENCE: METHODS AND APPLICATIONS, 2014, 8445 : 370 - 383
  • [7] Globally optimal training of unit boundaries in unit selection text-to-speech synthesis
    Bellegarda, Jerome R.
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (03): : 957 - 965
  • [8] Including Pitch Accent Optionality in Unit Selection Text-to-Speech Synthesis
    Badino, Leonardo
    Clark, Robert A. J.
    Strom, Volker
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2118 - 2121
  • [9] Diphone-based unit selection for Catalan text-to-speech synthesis
    Guaus, R
    Iriondo, I
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2000, 1902 : 277 - 282
  • [10] High quality Arabic text-to-speech synthesis using unit selection
    Abdelmalek, Raja
    Mnasri, Zied
    2016 13TH INTERNATIONAL MULTI-CONFERENCE ON SYSTEMS, SIGNALS & DEVICES (SSD), 2016, : 1 - 5