Evaluation of Expressive Speech Synthesis With Voice Conversion and Copy Resynthesis Techniques

被引:32
|
作者
Turk, Oytun [1 ]
Schroeder, Marc [2 ]
机构
[1] Sensory Inc, Portland, OR 97209 USA
[2] DFKI GmbH Language Technol Lab, Speech Grp, D-66123 Saarbrucken, Germany
关键词
Expressive speech synthesis; prosody; voice conversion; voice quality transformation;
D O I
10.1109/TASL.2010.2041113
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Generating expressive synthetic voices requires carefully designed databases that contain sufficient amount of expressive speech material. This paper investigates voice conversion and modification techniques to reduce database collection and processing efforts while maintaining acceptable quality and naturalness. In a factorial design, we study the relative contributions of voice quality and prosody as well as the amount of distortions introduced by the respective signal manipulation steps. The unit selection engine in our open source and modular text-to-speech (TTS) framework MARY is extended with voice quality transformation using either GMM-based prediction or vocal tract copy resynthesis. These algorithms are then cross-combined with various prosody copy resynthesis methods. The overall expressive speech generation process functions as a postprocessing step on TTS outputs to transform neutral synthetic speech into aggressive, cheerful, or depressed speech. Cross-combinations of voice quality and prosody transformation algorithms are compared in listening tests for perceived expressive style and quality. The results show that there is a tradeoff between identification and naturalness. Combined modeling of both voice quality and prosody leads to the best identification scores at the expense of lowest naturalness ratings. The fine detail of both voice quality and prosody, as preserved by the copy synthesis, did contribute to a better identification as compared to the approximate models.
引用
收藏
页码:965 / 973
页数:9
相关论文
共 50 条
  • [21] Comparing Speech Enhancement Techniques for Voice Adaptation-Based Speech Synthesis
    Eng, Nicholas
    Hui, C. T. Justine
    Hioka, Yusuke
    Watson, Catherine, I
    INTERSPEECH 2021, 2021, : 2761 - 2765
  • [22] Expressive speech synthesis: A review
    Govind D.
    Prasanna S.R.M.
    International Journal of Speech Technology, 2013, 16 (2) : 237 - 260
  • [23] Speech Variability Compensation for Expressive Speech Synthesis
    Chen, Yan-You
    Kuan, Ta-Wen
    Tsai, Chun-Yu
    Wang, Jhing-Fa
    Chang, Chia-Hao
    1ST INTERNATIONAL CONFERENCE ON ORANGE TECHNOLOGIES (ICOT 2013), 2013, : 210 - 213
  • [24] Intonation Conversion from Neutral to Expressive Speech
    Veaux, Christophe
    Rodet, Xavier
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2776 - +
  • [25] Emotion Conversion for Expressive Arabic Text to Speech
    Gamal, Doaa
    Rashwan, Mohsen
    Abdou, Sherif Mahdy
    2014 IEEE/ACS 11TH INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS (AICCSA), 2014, : 341 - 348
  • [26] Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement
    Toda, Tomoki
    Nakagiri, Mikihiro
    Shikano, Kiyohiro
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (09): : 2505 - 2517
  • [27] Voice characteristics conversion for HMM-based speech synthesis system
    Masuko, T
    Tokuda, K
    Kobayashi, T
    Imai, S
    1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 1611 - 1614
  • [28] Emotional speech synthesis based on improved codebook mapping voice conversion
    Wang, YP
    Ling, ZH
    Wang, RH
    AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION, PROCEEDINGS, 2005, 3784 : 374 - 381
  • [29] High-Individuality Voice Conversion Based on Concatenative Speech Synthesis
    Fujii, Kei
    Okawa, Jun
    Suigetsu, Kaori
    PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 26, PARTS 1 AND 2, DECEMBER 2007, 2007, 26 : 483 - 488
  • [30] IMPROVING VOICE QUALITY OF HMM-BASED SPEECH SYNTHESIS USING VOICE CONVERSION METHOD
    Jiao, Yishan
    Xie, Xiang
    Na, Xingyu
    Tu, Ming
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,