Heterogeneous separation consistency training for adaptation of unsupervised speech separation

被引：0

作者：

Jiangyu Han

Yanhua Long

机构：

[1] Shanghai Normal University,Key Innovation Group of Digital Humanities Resource and Research

[2] Shanghai Normal University,Shanghai Engineering Research Center of Intelligent Education and Bigdata

来源：

EURASIP Journal on Audio, Speech, and Music Processing | / 2023卷

关键词：

Unsupervised speech separation; Heterogeneous; Separation consistency; Cross-knowledge adaptation;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Recently, supervised speech separation has made great progress. However, limited by the nature of supervised training, most existing separation methods require ground-truth sources and are trained on synthetic datasets. This ground-truth reliance is problematic, because the ground-truth signals are usually unavailable in real conditions. Moreover, in many industry scenarios, the real acoustic characteristics deviate far from the ones in simulated datasets. Therefore, the performance usually degrades significantly when applying the supervised speech separation models to real applications. To address these problems, in this study, we propose a novel separation consistency training, termed SCT, to exploit the real-world unlabeled mixtures for improving cross-domain unsupervised speech separation in an iterative manner, by leveraging upon the complementary information obtained from heterogeneous (structurally distinct but behaviorally complementary) models. SCT follows a framework using two heterogeneous neural networks (HNNs) to produce high confidence pseudo labels of unlabeled real speech mixtures. These labels are then updated and used to refine the HNNs to produce more reliable consistent separation results for real mixture pseudo-labeling. To maximally utilize the large complementary information between different separation networks, a cross-knowledge adaptation is further proposed. Together with simulated dataset, those real mixtures with high confidence pseudo labels are then used to update the HNN separation models iteratively. In addition, we find that combing the heterogeneous separation outputs by a simple linear fusion can further slightly improve the final system performance. In this paper, we use cross-dataset to simulate the cross-domain situation in real-life. The term of “source domain” and “target domain” refer to the simulation set for model pre-training and the real unlabeled mixture for model adaptation. The proposed SCT is evaluated on both public reverberant English and anechoic Mandarin cross-domain separation tasks. Results show that, without any available ground-truth of target domain mixtures, the SCT can still significantly outperform our two strong baselines with up to 1.61 dB and 3.44 dB scale-invariant signal-to-noise ratio (SI-SNR) improvements, on the English and Mandarin cross-domain conditions, respectively.

引用

共 50 条

[21] ON PERMUTATION INVARIANT TRAINING FOR SPEECH SOURCE SEPARATION
Liu, Xiaoyu
Pons, Jordi
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6 - 10
[22] Probabilistic Permutation Invariant Training for Speech Separation
Yousefi, Midia
Khorram, Soheil
Hansen, John H. L.
INTERSPEECH 2019, 2019, : 4604 - 4608
[23] UNSUPERVISED TRAINING FOR DEEP SPEECH SOURCE SEPARATION WITH KULLBACK-LEIBLER DIVERGENCE BASED PROBABILISTIC LOSS FUNCTION
Togami, Masahito
Masuyama, Yoshiki
Komatsu, Tatsuya
Nakagome, Yu
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 56 - 60
[24] Deep neural network for water/fat separation: Supervised training, unsupervised training, and no training
Jafari, Ramin
Spincemaille, Pascal
Zhang, Jinwei
Nguyen, Thanh D.
Luo, Xianfu
Cho, Junghun
Margolis, Daniel
Prince, Martin R.
Wang, Yi
MAGNETIC RESONANCE IN MEDICINE, 2021, 85 (04) : 2263 - 2277
[25] UNSUPERVISED STYLE AND CONTENT SEPARATION BY MINIMIZING MUTUAL INFORMATION FOR SPEECH SYNTHESIS
Hu, Ting-Yao
Shrivastava, Ashish
Tuzel, Oncel
Dhir, Chandra
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3267 - 3271
[26] An Unsupervised Two-Talker Speech Separation System Based on CASA
Li, Hongyan
Wang, Yue
Zhao, Rongrong
Zhang, Xueying
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2018, 32 (07)
[27] Unsupervised Speech Separation Using Statistical, Auditory and Signal Processing Approaches
Hemavathi, R.
Swamy, R. Kumara
2018 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, SIGNAL PROCESSING AND NETWORKING (WISPNET), 2018,
[28] INTERRUPTED AND CASCADED PERMUTATION INVARIANT TRAINING FOR SPEECH SEPARATION
Yang, Gene-Ping
Wu, Szu-Lin
Mao, Yao-Wen
Lee, Hung-yi
Lee, Lin-shah
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6369 - 6373
[29] On Training Speech Separation Models With Various Numbers of Speakers
Kim, Hyeonseung
Shin, Jong Won
IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 1202 - 1206
[30] Adversarial Unsupervised Domain Adaptation for Harmonic-Percussive Source Separation
Lordelo, C.
Benetos, E.
Dixon, S.
Ahlback, S.
Ohlsson, P.
IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 81 - 85

← 1 2 3 4 5 →