LEARNING ACCENT REPRESENTATION WITH MULTI-LEVEL VAE TOWARDS CONTROLLABLE SPEECH SYNTHESIS

被引:2
|
作者
Melechovsky, Jan [1 ]
Mehrish, Ambuj [1 ]
Herremans, Dorien [1 ]
Sisman, Berrak [2 ]
机构
[1] Singapore Univ Technol & Design, Singapore, Singapore
[2] Univ Texas Dallas, Richardson, TX USA
关键词
Accent; Text-to-Speech; Multi-level Variational Autoencoder; Disentanglement; Controllable speech synthesis; CONVERSION;
D O I
10.1109/SLT54892.2023.10023072
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Accent is a crucial aspect of speech that helps define one's identity. We note that the state-of-the-art Text-to-Speech (TTS) systems can achieve high-quality generated voice, but still lack in terms of versatility and customizability. Moreover, they generally do not take into account accent, which is an important feature of speaking style. In this work, we utilize the concept of Multi-level VAE (ML-VAE) to build a control mechanism that aims to disentangle accent from a reference accented speaker; and to synthesize voices in different accents such as English, American, Irish, and Scottish. The proposed framework can also achieve high-quality accented voice generation for multi-speaker setup, which we believe is remarkable. We investigate the performance through objective metrics and conduct listening experiments for a subjective performance assessment. We showed that the proposed method achieves good performance for naturalness, speaker similarity, and accent similarity.
引用
收藏
页码:928 / 935
页数:8
相关论文
共 50 条
  • [21] A traget tracking method combining multi-level sparse representation and metric learning
    Peng, Meng
    Cai, Zi-Xing
    Chen, Bai-Fan
    Kongzhi yu Juece/Control and Decision, 2015, 30 (10): : 1791 - 1796
  • [22] Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation
    Wu, Dongming
    Dong, Xingping
    Shao, Ling
    Shen, Jianbing
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4986 - 4995
  • [23] Multi-Level Visual Representation with Semantic-Reinforced Learning for Video Captioning
    Dong, Chengbo
    Chen, Xinru
    Chen, Aozhu
    Hu, Fan
    Wang, Zihan
    Li, Xirong
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4750 - 4754
  • [24] Remote Sensing Image Scene Classification via Multi-Level Representation Learning
    Fu, Wei
    Yang, Lishuang
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 2942 - 2948
  • [25] Towards Zero-Shot Multi-Speaker Multi-Accent Text-to-Speech Synthesis
    Zhang, Mingyang
    Zhou, Xuehao
    Wu, Zhizheng
    Li, Haizhou
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 947 - 951
  • [26] Constructing multi-level speech database for spontaneous speech processing
    Hahn, M
    Kim, S
    Lee, JC
    Lee, YJ
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1930 - 1933
  • [27] Contrastive and adversarial regularized multi-level representation learning for incomplete multi-view clustering
    Wang, Haiyue
    Zhang, Wensheng
    Ma, Xiaoke
    NEURAL NETWORKS, 2024, 172
  • [28] Multi-level Multi-task representation learning with adaptive fusion for multimodal sentiment analysis
    Chuanbo Zhu
    Min Chen
    Haomin Li
    Sheng Zhang
    Han Liang
    Chao Sun
    Yifan Liu
    Jincai Chen
    Neural Computing and Applications, 2025, 37 (3) : 1491 - 1508
  • [29] Assessing democratic representation in multi-level democracies
    Daeubler, Thomas
    Mueller, Jochen
    Stecker, Christian
    WEST EUROPEAN POLITICS, 2018, 41 (03) : 541 - 564
  • [30] Multi-level Semantic Representation for Flower Classification
    Lin, Chuang
    Yao, Hongxun
    Yu, Wei
    Tang, Wenbo
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2017, PT I, 2018, 10735 : 325 - 335