Phonetic and Prosodic Information Estimation from Texts for Genuine Japanese End-to-End Text-to-Speech

被引:4
|
作者
Kakegawa, Naoto [1 ]
Hara, Sunao [1 ]
Abe, Masanobu [1 ]
Ijima, Yusuke [2 ]
机构
[1] Okayama Univ, Grad Sch Interdisciplinary Sci & Engn Hlth Syst, Okayama, Japan
[2] NTT Corp, Tokyo, Japan
来源
关键词
Text-to-speech; Grapheme-to-Phoneme (G2P); Attention mechanism; transformer; sequence-to-sequence neural networks;
D O I
10.21437/Interspeech.2021-914
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
The biggest obstacle to develop end-to-end Japanese text-to-speech (TTS) systems is to estimate phonetic and prosodic information (PPI) from Japanese texts. The following are the reasons: (1) the Kanji characters of the Japanese writing system have multiple corresponding pronunciations, (2) there is no separation mark between words, and (3) an accent nucleus must be assigned at appropriate positions. In this paper, we propose to solve the problems by neural machine translation (NMT) on the basis of encoder-decoder models, and compare NMT models of recurrent neural networks and the Transformer architecture. The proposed model handles texts on token (character) basis, although conventional systems handle them on word basis. To ensure the potential of the proposed approach, NMT models are trained using pairs of sentences and their PPIs that are generated by a conventional Japanese TTS system from 5 million sentences. Evaluation experiments were performed using PPIs that are manually annotated for 5,142 sentences. The experimental results showed that the Transformer architecture has the best performance, with 98.0% accuracy for phonetic information estimation and 95.0% accuracy for PPI estimation. Judging from the results, NMT models are promising toward end-to-end Japanese TTS.
引用
收藏
页码:126 / 130
页数:5
相关论文
共 50 条
  • [1] End-to-End Mongolian Text-to-Speech System
    Li, Jingdong
    Zhang, Hui
    Liu, Rui
    Zhang, Xueliang
    Bao, Feilong
    2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 483 - 487
  • [2] End-to-End Thai Text-to-Speech with Linguistic Unit
    Wisetpaitoon, Kontawat
    Singkul, Sattaya
    Sakdejayont, Theerat
    Chalothorn, Tawunrat
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 951 - 959
  • [3] EXPLORING END-TO-END NEURAL TEXT-TO-SPEECH SYNTHESIS FOR ROMANIAN
    Dumitrache, Marius
    Rebedea, Traian
    PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE LINGUISTIC RESOURCES AND TOOLS FOR NATURAL LANGUAGE PROCESSING, 2020, : 93 - 102
  • [4] Effective Emotion Transplantation in an End-to-End Text-to-Speech System
    Joo, Young-Sun
    Bae, Hanbin
    Kim, Young-Ik
    Cho, Hoon-Young
    Kang, Hong-Goo
    IEEE ACCESS, 2020, 8 : 161713 - 161719
  • [5] FPETS : Fully Parallel End-to-End Text-to-Speech System
    Ma, Dabiao
    Su, Zhiba
    Wang, Wenxuan
    Lu, Yuhao
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 8457 - 8463
  • [6] Myanmar Text-to-Speech Synthesis Using End-to-End Model
    Qin, Qinglai
    Yang, Jian
    Li, Peiying
    2020 4TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, NLPIR 2020, 2020, : 6 - 11
  • [7] Improvement of the end-to-end scene text recognition method for "text-to-speech" conversion
    Makhmudov, Fazliddin
    Mukhiddinov, Mukhriddin
    Abdusalomov, Akmalbek
    Avazov, Kuldoshbay
    Khamdamov, Utkir
    Cho, Young Im
    INTERNATIONAL JOURNAL OF WAVELETS MULTIRESOLUTION AND INFORMATION PROCESSING, 2020, 18 (06)
  • [8] Improving transfer of expressivity for end-to-end multispeaker text-to-speech synthesis
    Kulkarni, Ajinkya
    Colotte, Vincent
    Jouvet, Denis
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 31 - 35
  • [9] Investigation of Input Alphabets of End-to-End Lithuanian Text-to-Speech Synthesizer
    Kasparaitis, Pijus
    Antanavicius, Danielius
    BALTIC JOURNAL OF MODERN COMPUTING, 2023, 11 (02): : 285 - 296
  • [10] Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
    Kim, Jaehyeon
    Kong, Jungil
    Son, Juhee
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139