Heterogeneous separation consistency training for adaptation of unsupervised speech separation

被引:0
|
作者
Jiangyu Han
Yanhua Long
机构
[1] Shanghai Normal University,Key Innovation Group of Digital Humanities Resource and Research
[2] Shanghai Normal University,Shanghai Engineering Research Center of Intelligent Education and Bigdata
关键词
Unsupervised speech separation; Heterogeneous; Separation consistency; Cross-knowledge adaptation;
D O I
暂无
中图分类号
学科分类号
摘要
Recently, supervised speech separation has made great progress. However, limited by the nature of supervised training, most existing separation methods require ground-truth sources and are trained on synthetic datasets. This ground-truth reliance is problematic, because the ground-truth signals are usually unavailable in real conditions. Moreover, in many industry scenarios, the real acoustic characteristics deviate far from the ones in simulated datasets. Therefore, the performance usually degrades significantly when applying the supervised speech separation models to real applications. To address these problems, in this study, we propose a novel separation consistency training, termed SCT, to exploit the real-world unlabeled mixtures for improving cross-domain unsupervised speech separation in an iterative manner, by leveraging upon the complementary information obtained from heterogeneous (structurally distinct but behaviorally complementary) models. SCT follows a framework using two heterogeneous neural networks (HNNs) to produce high confidence pseudo labels of unlabeled real speech mixtures. These labels are then updated and used to refine the HNNs to produce more reliable consistent separation results for real mixture pseudo-labeling. To maximally utilize the large complementary information between different separation networks, a cross-knowledge adaptation is further proposed. Together with simulated dataset, those real mixtures with high confidence pseudo labels are then used to update the HNN separation models iteratively. In addition, we find that combing the heterogeneous separation outputs by a simple linear fusion can further slightly improve the final system performance. In this paper, we use cross-dataset to simulate the cross-domain situation in real-life. The term of “source domain” and “target domain” refer to the simulation set for model pre-training and the real unlabeled mixture for model adaptation. The proposed SCT is evaluated on both public reverberant English and anechoic Mandarin cross-domain separation tasks. Results show that, without any available ground-truth of target domain mixtures, the SCT can still significantly outperform our two strong baselines with up to 1.61 dB and 3.44 dB scale-invariant signal-to-noise ratio (SI-SNR) improvements, on the English and Mandarin cross-domain conditions, respectively.
引用
收藏
相关论文
共 50 条
  • [1] Heterogeneous separation consistency training for adaptation of unsupervised speech separation
    Han, Jiangyu
    Long, Yanhua
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2023, 2023 (01)
  • [2] UNSUPERVISED ADAPTATION WITH DOMAIN SEPARATION NETWORKS FOR ROBUST SPEECH RECOGNITION
    Meng, Zhong
    Chen, Zhuo
    Mazalov, Vadim
    Li, Jinyu
    Gong, Yifan
    2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 214 - 221
  • [3] An Unsupervised Approach to Cochannel Speech Separation
    Hu, Ke
    Wang, DeLiang
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (01): : 120 - 129
  • [4] Heterogeneous Target Speech Separation
    Tzinis, Efthymios
    Wichern, Gordon
    Subramanian, Aswin
    Smaragdis, Paris
    Le Roux, Jonathan
    INTERSPEECH 2022, 2022, : 1796 - 1800
  • [5] MixCycle: Unsupervised Speech Separation via Cyclic Mixture Permutation Invariant Training
    Karamatli, Ertug
    Kirbiz, Serap
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2637 - 2641
  • [6] Unsupervised single channel speech separation based on optimized subspace separation
    Wiem, Belhedi
    Anouar, Ben Messaoud Mohamed
    Mowlaee, Pejman
    Aicha, Bouzid
    SPEECH COMMUNICATION, 2018, 96 : 93 - 101
  • [7] A study on unsupervised monaural reverberant speech separation
    Hemavathi, R.
    Kumaraswamy, R.
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2020, 23 (02) : 451 - 457
  • [8] A study on unsupervised monaural reverberant speech separation
    R. Hemavathi
    R. Kumaraswamy
    International Journal of Speech Technology, 2020, 23 : 451 - 457
  • [9] Unsupervised sequential organization for cochannel speech separation
    Hu, Ke
    Wang, DeLiang
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2794 - +
  • [10] Extending Interpolation Consistency Training for Unsupervised Domain Adaptation
    Gharib, Shayan
    Klami, Arto
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,