Heterogeneous separation consistency training for adaptation of unsupervised speech separation

被引:0
|
作者
Jiangyu Han
Yanhua Long
机构
[1] Shanghai Normal University,Key Innovation Group of Digital Humanities Resource and Research
[2] Shanghai Normal University,Shanghai Engineering Research Center of Intelligent Education and Bigdata
来源
EURASIP Journal on Audio, Speech, and Music Processing | / 2023卷
关键词
Unsupervised speech separation; Heterogeneous; Separation consistency; Cross-knowledge adaptation;
D O I
暂无
中图分类号
学科分类号
摘要
Recently, supervised speech separation has made great progress. However, limited by the nature of supervised training, most existing separation methods require ground-truth sources and are trained on synthetic datasets. This ground-truth reliance is problematic, because the ground-truth signals are usually unavailable in real conditions. Moreover, in many industry scenarios, the real acoustic characteristics deviate far from the ones in simulated datasets. Therefore, the performance usually degrades significantly when applying the supervised speech separation models to real applications. To address these problems, in this study, we propose a novel separation consistency training, termed SCT, to exploit the real-world unlabeled mixtures for improving cross-domain unsupervised speech separation in an iterative manner, by leveraging upon the complementary information obtained from heterogeneous (structurally distinct but behaviorally complementary) models. SCT follows a framework using two heterogeneous neural networks (HNNs) to produce high confidence pseudo labels of unlabeled real speech mixtures. These labels are then updated and used to refine the HNNs to produce more reliable consistent separation results for real mixture pseudo-labeling. To maximally utilize the large complementary information between different separation networks, a cross-knowledge adaptation is further proposed. Together with simulated dataset, those real mixtures with high confidence pseudo labels are then used to update the HNN separation models iteratively. In addition, we find that combing the heterogeneous separation outputs by a simple linear fusion can further slightly improve the final system performance. In this paper, we use cross-dataset to simulate the cross-domain situation in real-life. The term of “source domain” and “target domain” refer to the simulation set for model pre-training and the real unlabeled mixture for model adaptation. The proposed SCT is evaluated on both public reverberant English and anechoic Mandarin cross-domain separation tasks. Results show that, without any available ground-truth of target domain mixtures, the SCT can still significantly outperform our two strong baselines with up to 1.61 dB and 3.44 dB scale-invariant signal-to-noise ratio (SI-SNR) improvements, on the English and Mandarin cross-domain conditions, respectively.
引用
收藏
相关论文
共 50 条
  • [31] Fast convolutive blind speech separation via subband adaptation
    Duplessis-Beaulieu, F
    Champagne, B
    2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO AND ELECTROACOUSTICS MULTIMEDIA SIGNAL PROCESSING, 2003, : 513 - 516
  • [32] Fast convolutive blind speech separation via subband adaptation
    Duplessis-Beaulieu, F
    Champagne, B
    2003 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS PROCEEDINGS, 2003, : 147 - 147
  • [33] CONSISTENCY AND ADAPTATION IN APRAXIA OF SPEECH
    DEAL, JL
    JOURNAL OF COMMUNICATION DISORDERS, 1974, 7 (02) : 135 - 140
  • [34] An Improved Unsupervised Single-Channel Speech Separation Algorithm for Processing Speech Sensor Signals
    Jiang, Dazhi
    He, Zhihui
    Lin, Yingqing
    Chen, Yifei
    Xu, Linyan
    WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2021, 2021
  • [35] UNSUPERVISED TRAINING OF A DEEP CLUSTERING MODEL FOR MULTICHANNEL BLIND SOURCE SEPARATION
    Drude, Lukas
    Hasenklever, Daniel
    Haeb-Umbach, Reinhold
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 695 - 699
  • [36] Teacher-Student MixIT for Unsupervised and Semi-supervised Speech Separation
    Zhang, Jisi
    Zorila, Catalin
    Doddipatla, Rama
    Barker, Jon
    INTERSPEECH 2021, 2021, : 3495 - 3499
  • [37] Unsupervised Domain Adaptation with Semantic Consistency Across Heterogeneous Modalities for MRI Prostate Lesion Segmentation
    Chiou, Eleni
    Giganti, Francesco
    Punwani, Shonit
    Kokkinos, Iasonas
    Panagiotaki, Eleftheria
    DOMAIN ADAPTATION AND REPRESENTATION TRANSFER, AND AFFORDABLE HEALTHCARE AND AI FOR RESOURCE DIVERSE GLOBAL HEALTH (DART 2021), 2021, 12968 : 90 - 100
  • [38] VISUALVOICE: Audio-Visual Speech Separation with Cross-Modal Consistency
    Gao, Ruohan
    Grauman, Kristen
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15490 - 15500
  • [39] Unsupervised Data Augmentation for Consistency Training
    Xie, Qizhe
    Dai, Zihang
    Hovy, Eduard
    Luong, Minh-Thang
    Le, Quoc V.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [40] SPEECH SEPARATION FOR SPEECH RECOGNITION
    DECHEVEIGNE, A
    HAWAHARA, H
    AIKAWA, K
    LEA, A
    JOURNAL DE PHYSIQUE IV, 1994, 4 (C5): : 545 - 548