Acoustic speech unit segmentation for concatenative synthesis

被引:4
|
作者
Torres, H. M. [1 ]
Gurlekian, J. A. [1 ]
机构
[1] Hosp Clin Buenos Aires, Inst Neurociencias Aplicadas, Consejo Nacl Invest Cient & Tecn, Lab Invest Sensoriales, RA-1120 Buenos Aires, DF, Argentina
来源
COMPUTER SPEECH AND LANGUAGE | 2008年 / 22卷 / 02期
关键词
Text to speech; Unit segmentation; Corpus-driven; Polyphones;
D O I
10.1016/j.csl.2007.07.002
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Synthesis by concatenation of natural speech improves perceptual results when phonemes and syllables are segmented at places where spectral variations are small [Klatt, D., 1987. Review of text-to-speech conversion for English. J. Acoust. Soc. Am 82 (3), 737-793]. An automatic segmentation method is explored here, using a tool based on a combination of Entropy Coding, Multiresolution Analysis, and Kohonen's Self Organized Maps. The segmentation method considers that there are no limits imposed by any linguistic unit. Resulting waveforms represent phone chains dominated by spectral dynamic structures. Each acoustic unit obtained could be composed of a variety of phonemes or a segmented part of them at the unit boundary. The number of units and unit structure are speaker dependent, i.e. rate, segmental and suprasegmental distinctive features affect them as dynamic structure varies. Results obtained from two databases - one male, one female - of 741 sentences each show this dependence, presenting a different number of units and occurrences for each speaker. Nevertheless, both speakers show a high occurrence of three (36-24%) and four (29-27%) phoneme sequences. Vowel-consonant-vowel sequences are the most frequent type (9.7-8.3%). Consonant-vowel syllables, which are phonemically frequent in Spanish (58%), are less represented (6.6-3.2%) using this method. The relevance of half phone segmentation is verified given that 66% for the female speaker and 53% for the male speaker, of the total units start and end with a segmented phone. Perceptual experiments showed that concatenated speech, created with dynamic acoustic units, was judged more natural than with diphone units. (C) 2007 Elsevier Ltd. All rights reserved.
引用
收藏
页码:196 / 206
页数:11
相关论文
共 50 条
  • [31] Unit selection for speech synthesis based on acoustic criteria
    Rouibia, S
    Rosec, O
    Moudenc, T
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2005, 3658 : 281 - 287
  • [32] Spectral dynamics as a source of discontinuity in concatenative speech synthesis
    Kirkpatrick, Barry
    O'Brien, Darragh
    Scaife, Ronan
    Errity, Andrew
    PROCEEDINGS OF THE 2007 15TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING, 2007, : 615 - +
  • [33] Syllable Based Concatenative Synthesis for Text to Speech Conversion
    Ananthi, S.
    Dhanalakshmi, P.
    COMPUTATIONAL INTELLIGENCE IN DATA MINING, VOL 3, 2015, 33
  • [34] Quality Preserving Compression of a Concatenative Text-To-Speech Acoustic Database
    Shoham, Tamar
    Malah, David
    Shechtman, Slava
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (03): : 1056 - 1068
  • [35] Fast concatenative speech synthesis using pre-fused speech units based on the plural unit selection and fusion method
    Tamura, Masatsune
    Mizutani, Tatsuya
    Kagoshima, Takehiko
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2007, E90D (02) : 544 - 553
  • [36] A concatenative speech synthesis for monosyllabic languages with limited data
    Phung, Trung-Nghia
    Luong, Mai Chi
    Akagi, Masato
    2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2012,
  • [37] Selection in a concatenative speech synthesis system using a large speech database
    Hunt, AJ
    Black, AW
    1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 373 - 376
  • [38] Perceptual and objective detection of discontinuities in concatenative speech synthesis
    Stylianou, Y
    Syrdal, AK
    2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING - VOL IV: SIGNAL PROCESSING FOR COMMUNICATIONS; VOL V: SIGNAL PROCESSING EDUCATION SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO & ELECTROACOUSTICS; VOL VI: SIGNAL PROCESSING THEORY & METHODS STUDENT FORUM, 2001, : 837 - 840
  • [39] Six Approaches to Limited Domain Concatenative Speech Synthesis
    Utama, Robert J.
    Syrdal, Ann K.
    Conkie, Alistair
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 2058 - +
  • [40] Challenges and rewards in using parametric or concatenative speech synthesis
    Henton C.
    International Journal of Speech Technology, 2002, 5 (02) : 117 - 131