UNSUPERVISED PROSODIC PHRASE BOUNDARY LABELING OF MANDARIN SPEECH SYNTHESIS DATABASE USING CONTEXT-DEPENDENT HMM

被引:0
|
作者
Yang, Chen-Yu [1 ]
Ling, Zhen-Hua [1 ]
Dai, Li-Rong [1 ]
机构
[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei 230026, Peoples R China
关键词
speech synthesis; phrase boundary; unsupervised labeling; context-dependent hidden Markov model; Viterbi decoding;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, an automatic and unsupervised method based on context-dependent hidden Markov model (CD-HMM) is proposed for labeling the phrase boundary positions of a Mandarin speech synthesis database. The initial phrase boundary labels are predicted by clustering the durations of the pauses between every two prosodic words in an unsupervised way. Then, the CD-HMMs for the spectrum, F0 and phone duration are estimated by a means similar to the HMM-based parametric speech synthesis using the initial phrase boundary labels. These labels are further updated by Viterbi decoding under the maximum likelihood criterion given the acoustic feature sequences and the trained CD-HMMs. The model training and Viterbi decoding procedures are conducted iteratively until convergence. Experimental results on a Mandarin speech synthesis database show that this method is able to label the phrase boundary positions much more accurately than the text-analysis-based method without requiring any manually labeled training data. The unit selection speech synthesis system constructed using the phrase boundary labels generated by our proposed method achieves similar performance to that using the manual labels.
引用
收藏
页码:6875 / 6879
页数:5
相关论文
共 50 条
  • [31] Context-Dependent Feature Selection using Unsupervised Contexts Applied to GPR-Based Landmine Detection
    Ratto, Christopher R.
    Torrione, Peter A.
    Collins, Leslie M.
    DETECTION AND SENSING OF MINES, EXPLOSIVE OBJECTS, AND OBSCURED TARGETS XV, 2010, 7664
  • [32] Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthesis
    Khorram, Soheil
    Sameti, Hossein
    Bahmaninezhad, Fahimeh
    King, Simon
    Drugman, Thomas
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2014,
  • [33] Using Bayesian Networks to find relevant context features for HMM-based speech synthesis
    Lu, Heng
    King, Simon
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 1142 - 1145
  • [34] Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthesis
    Soheil Khorram
    Hossein Sameti
    Fahimeh Bahmaninezhad
    Simon King
    Thomas Drugman
    EURASIP Journal on Audio, Speech, and Music Processing, 2014
  • [35] Fundamental Frequency Contour Reshaping in HMM-based Speech Synthesis and Realization of Prosodic Focus Using Generation Process Model
    Hirose, Keikichi
    Hashimoto, Hiroya
    Ikeshima, Jun
    Minematsu, Nobuaki
    PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON SPEECH PROSODY, VOLS I AND II, 2012, : 171 - 174
  • [36] A quantitative method for modeling context in concatenative synthesis using large speech database
    Hamza, W
    Rashwan, M
    Afify, M
    2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING - VOL IV: SIGNAL PROCESSING FOR COMMUNICATIONS; VOL V: SIGNAL PROCESSING EDUCATION SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO & ELECTROACOUSTICS; VOL VI: SIGNAL PROCESSING THEORY & METHODS STUDENT FORUM, 2001, : 789 - 792
  • [37] Speaking style adaptation using context clustering decision tree for HMM-based speech synthesis
    Yamagishi, J
    Tachibana, M
    Masuko, T
    Kobayashi, T
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 5 - 8
  • [38] Unsupervised stress information labeling using Gaussian process latent variable model for statistical speech synthesis
    Moungsri, Decha
    Koriyama, Tomoki
    Kobayashi, Takao
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1517 - 1521
  • [39] Dysarthric Speech Recognition Error Correction Using Weighted Finite State Transducers Based on Context-Dependent Pronunciation Variation
    Seong, Woo Kyeong
    Park, Ji Hun
    Kim, Hong Kook
    COMPUTERS HELPING PEOPLE WITH SPECIAL NEEDS, PT II, 2012, 7383 : 475 - 482
  • [40] TONE RECOGNITION FOR CONTINUOUS MANDARINE SPEECH WITH LIMITED TRAINING DATA USING SELECTED CONTEXT-DEPENDENT HIDDEN MARKOV-MODELS
    WANG, HM
    LEE, LS
    JOURNAL OF THE CHINESE INSTITUTE OF ENGINEERS, 1994, 17 (06) : 775 - 784