Acoustic speech unit segmentation for concatenative synthesis

被引:4
|
作者
Torres, H. M. [1 ]
Gurlekian, J. A. [1 ]
机构
[1] Hosp Clin Buenos Aires, Inst Neurociencias Aplicadas, Consejo Nacl Invest Cient & Tecn, Lab Invest Sensoriales, RA-1120 Buenos Aires, DF, Argentina
来源
COMPUTER SPEECH AND LANGUAGE | 2008年 / 22卷 / 02期
关键词
Text to speech; Unit segmentation; Corpus-driven; Polyphones;
D O I
10.1016/j.csl.2007.07.002
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Synthesis by concatenation of natural speech improves perceptual results when phonemes and syllables are segmented at places where spectral variations are small [Klatt, D., 1987. Review of text-to-speech conversion for English. J. Acoust. Soc. Am 82 (3), 737-793]. An automatic segmentation method is explored here, using a tool based on a combination of Entropy Coding, Multiresolution Analysis, and Kohonen's Self Organized Maps. The segmentation method considers that there are no limits imposed by any linguistic unit. Resulting waveforms represent phone chains dominated by spectral dynamic structures. Each acoustic unit obtained could be composed of a variety of phonemes or a segmented part of them at the unit boundary. The number of units and unit structure are speaker dependent, i.e. rate, segmental and suprasegmental distinctive features affect them as dynamic structure varies. Results obtained from two databases - one male, one female - of 741 sentences each show this dependence, presenting a different number of units and occurrences for each speaker. Nevertheless, both speakers show a high occurrence of three (36-24%) and four (29-27%) phoneme sequences. Vowel-consonant-vowel sequences are the most frequent type (9.7-8.3%). Consonant-vowel syllables, which are phonemically frequent in Spanish (58%), are less represented (6.6-3.2%) using this method. The relevance of half phone segmentation is verified given that 66% for the female speaker and 53% for the male speaker, of the total units start and end with a segmented phone. Perceptual experiments showed that concatenated speech, created with dynamic acoustic units, was judged more natural than with diphone units. (C) 2007 Elsevier Ltd. All rights reserved.
引用
收藏
页码:196 / 206
页数:11
相关论文
共 50 条
  • [1] Specific Acoustic Unit Processing in Concatenative Romanian Speech Synthesis Used for Talking Agents
    Negrescu, Cristian
    Ciobanu, Amelia
    Ilie, Mihai Daniel
    2013 7TH CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN - COMPUTER DIALOGUE (SPED), 2013,
  • [2] An evaluation of automatic phone segmentation for concatenative speech synthesis
    Kawai, H
    Toda, T
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 677 - 680
  • [3] Triphone based unit selection for concatenative visual speech synthesis
    Huang, FJ
    Cosatto, E
    Graf, HP
    2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2037 - 2040
  • [4] Joint prosody prediction and unit selection for concatenative speech synthesis
    Bulyko, I
    Ostendorf, M
    2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING, 2001, : 781 - 784
  • [5] LSM-based unit pruning for concatenative speech synthesis
    Bellegarda, Jerome R.
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 521 - 524
  • [6] UnitNet: A Sequence-to-Sequence Acoustic Model for Concatenative Speech Synthesis
    Zhou, Xiao
    Ling, Zhen-Hua
    Dai, Li-Rong
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2643 - 2655
  • [7] Speech unit selection based on target values driven by speech data in concatenative speech synthesis
    Hirai, T
    Tenpaku, S
    Shikano, K
    PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, 2002, : 43 - 46
  • [8] Concatenative speech synthesis based on the plural unit selection and fusion method
    Mizutani, T
    Kagoshima, T
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2005, E88D (11): : 2565 - 2572
  • [9] An efficient unit-selection method for embedded concatenative speech synthesis
    Gros, Jerneja Zganec
    Zganec, Mario
    INFORMACIJE MIDEM-JOURNAL OF MICROELECTRONICS ELECTRONIC COMPONENTS AND MATERIALS, 2007, 37 (03): : 158 - 164
  • [10] A short latency unit selection method with redundant search for concatenative speech synthesis
    Nishizawa, Nobuyuki
    Kawai, Hisashi
    2006 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-13, 2006, : 757 - 760