A Dynamic Cost Weighting Framework for Unit Selection Text-to-Speech Synthesis

被引:9
|
作者
Bellegarda, Jerome R. [1 ]
机构
[1] Apple Comp Inc, Speech & Language Technol, Cupertino, CA 95014 USA
关键词
Candidate ranking; concatenation-specific cost weighting; concatenative speech synthesis; multiple information streams; unit selection;
D O I
10.1109/TASL.2009.2035209
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Unit selection text-to-speech synthesis relies on multiple cost criteria, each encapsulating a different aspect of acoustic and prosodic context at any given concatenation point. Constraints are normally invoked on diverse characteristics such as inter-unit discontinuity, overall pitch contour, local duration profile, etc., leading to costs often too heterogeneous for a direct quantitative comparison. In order to rank available candidate units, this complexity must be reduced to a single number, and the relative importance of each information stream becomes highly critical. Yet this influence is typically determined in an empirical manner (e. g., based on a limited amount of synthesized data), yielding global weights that are thus applied to broad classes of concatenations indiscriminately. This paper proposes an alternative approach, dynamic cost weighting, based on a data-driven framework separately optimized for each concatenation considered. Specifically, the cost distribution in every stream is dynamically leveraged on a per concatenation basis to locally shift weight towards those characteristics that offer a high discrimination between candidate units, and away from those characteristics that are intrinsically less discriminative. An illustrative case study demonstrates the potential benefits of this solution, and listening evidence suggests that it does indeed entail higher perceived TTS quality.
引用
收藏
页码:1455 / 1463
页数:9
相关论文
共 50 条
  • [21] A framework for a Bangla concatenative text-to-speech synthesis system
    Syed, MR
    Chakrobartty, S
    Bignall, RJ
    Innovations Through Information Technology, Vols 1 and 2, 2004, : 1318 - 1320
  • [22] Extracting user preferences by GTM for aiGA weight tuning in unit selection text-to-speech synthesis
    Formiga, Lluis
    Alias, Francese
    COMPUTATIONAL AND AMBIENT INTELLIGENCE, 2007, 4507 : 654 - +
  • [23] SMALL FOOTPRINT HYBRID STATISTICAL/UNIT SELECTION TEXT-TO-SPEECH SYNTHESIS SYSTEM FOR AGGLUTINATIVE LANGUAGES
    Guner, Ekrem
    Demiroglu, Cenk
    2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4537 - 4540
  • [24] AN ACCENT-UNIT MODEL OF INTONATION FOR TEXT-TO-SPEECH SYNTHESIS
    JOHNSON, M
    HOUSE, J
    PROCEEDINGS : INSTITUTE OF ACOUSTICS, VOL 8, PART 7: SPEECH & HEARING, 1986, 8 : 409 - 416
  • [25] TEXT-TO-SPEECH SYSTEMS FOR FILIPINO USING UNIT SELECTION AND DEEP LEARNING
    Renovalles, Edsel Jedd
    Lucas, Crisron Rudolf
    de Leon, Franz
    Aquino, Angelina
    Jalandoni, Izza
    2021 24TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA), 2021, : 212 - 217
  • [26] Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework ☆
    Ma, Mingbo
    Zheng, Baigong
    Liu, Kaibo
    Zheng, Renjie
    Liu, Hairong
    Peng, Kainan
    Church, Kenneth
    Huang, Liang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 3886 - 3896
  • [27] Text and Speech Corpora for Text-To-Speech Synthesis of Tales
    Doukhan, David
    Rosset, Sophie
    Rilliard, Albert
    d'Alessandro, Christophe
    Adda-Decker, Martine
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 1003 - 1010
  • [28] Multilingual text-to-speech synthesis
    Black, AW
    Lenzo, KA
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL III, PROCEEDINGS: IMAGE AND MULTIDIMENSIONAL SIGNAL PROCESSING SPECIAL SESSIONS, 2004, : 761 - 764
  • [29] Improving text-to-speech synthesis
    Tatham, M
    Lewis, E
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1856 - 1859
  • [30] An introduction to text-to-speech synthesis
    Fitzpatrick, E
    COMPUTATIONAL LINGUISTICS, 1998, 24 (02) : 322 - 323