Scalable Factorized Hierarchical Variational Autoencoder Training

被引:8
|
作者
Hsu, Wei-Ning [1 ]
Glass, James [1 ]
机构
[1] MIT, Comp Sci & Artificial Intelligence Lab, 77 Massachusetts Ave, Cambridge, MA 02139 USA
关键词
unsupervised learning; speech representation learning; factorized hierarchical variational autoencoder;
D O I
10.21437/Interspeech.2018-1034
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Deep generative models have achieved great success in unsupervised learning with the ability to capture complex nonlinear relationships between latent generating factors and observations. Among them, a factorized hierarchical variational autoencoder (FHVAE) is a variational inference-based model that formulates a hierarchical generative process for sequential data. Specifically, an FHVAE model can learn disentangled and interpretable representations, which have been proven useful for numerous speech applications. such as speaker verification, robust speech recognition, and voice conversion. However, as we will elaborate in this paper, the training algorithm proposed in the original paper is not scalable to datasets of thousands of hours, which makes this model less applicable on a larger scale. After identifying limitations in terms of runtime, memory, and hyperparameter optimization, we propose a hierarchical sampling training algorithm to address all three issues. Our proposed method is evaluated comprehensively on a wide variety of datasets, ranging from 3 to 1,000 hours and involving different types of generating factors. such as recording conditions and noise types. In addition, we also present a new visualization method for qualitatively evaluating the performance with respect to the interpretability and disentanglement. Models trained with our proposed algorithm demonstrate the desired characteristics on all the datasets.
引用
收藏
页码:1462 / 1466
页数:5
相关论文
共 50 条
  • [1] DISENTANGLED SPEECH REPRESENTATION LEARNING BASED ON FACTORIZED HIERARCHICAL VARIATIONAL AUTOENCODER WITH SELF-SUPERVISED OBJECTIVE
    Xie, Yuying
    Arildsen, Thomas
    Tan, Zheng-Hua
    2021 IEEE 31ST INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2021,
  • [2] VARIATIONAL AND HIERARCHICAL RECURRENT AUTOENCODER
    Chien, Jen-Tzung
    Wang, Chun-Wei
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3202 - 3206
  • [3] MULTI-SPEAKER AND MULTI-DOMAIN EMOTIONAL VOICE CONVERSION USING FACTORIZED HIERARCHICAL VARIATIONAL AUTOENCODER
    Elgaar, Mohamed
    Park, Jungbae
    Lee, Sang Wan
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7769 - 7773
  • [4] NVAE: A Deep Hierarchical Variational Autoencoder
    Vahdat, Arash
    Kautz, Jan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [5] A Multimodal Hierarchical Variational Autoencoder for Saliency Detection
    Yu, Zhengyang
    Zhang, Jing
    Barnes, Nick
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [6] HiCoVA: Hierarchical Conditional Variational Autoencoder for Keyphrase Generation
    Santosh, T. Y. S. S.
    Reddy, Nikhil, V
    Anoop, V
    Sanyal, Debarshi Kumar
    Das, Partha Pratim
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 3448 - 3452
  • [7] Conditional Deep Hierarchical Variational Autoencoder for Voice Conversion
    Akuzawa, Kei
    Onishi, Kotaro
    Takiguchi, Keisuke
    Mametani, Kohki
    Mori, Koichiro
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 808 - 813
  • [8] HGMVAE: hierarchical disentanglement in Gaussian mixture variational autoencoder
    Zhou, Jiashuang
    Liu, Yongqi
    Du, Xiaoqin
    VISUAL COMPUTER, 2024, 40 (10): : 7491 - 7502
  • [9] CONTRASTIVE PREDICTIVE CODING SUPPORTED FACTORIZED VARIATIONAL AUTOENCODER FOR UNSUPERVISED LEARNING OF DISENTANGLED SPEECH REPRESENTATIONS
    Ebbers, Janek
    Kuhlmann, Michael
    Cord-Landwehr, Tobias
    Haeb-Umbach, Reinhold
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3860 - 3864
  • [10] Speaker and Style Disentanglement of Speech Based on Contrastive Predictive Coding Supported Factorized Variational Autoencoder
    Xie, Yuying
    Kuhlmann, Michael
    Rautenberg, Frederik
    Tan, Zheng-Hua
    Haeb-Umbach, Reinhold
    32ND EUROPEAN SIGNAL PROCESSING CONFERENCE, EUSIPCO 2024, 2024, : 436 - 440