ADAPTING SPEECH SEPARATION TO REAL-WORLD MEETINGS USING MIXTURE INVARIANT TRAINING

被引:7
|
作者
Sivaraman, Aswin [1 ,2 ]
Wisdom, Scott [1 ]
Erdogan, Hakan [1 ]
Hershey, John R. [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] Indiana Univ, Bloomington, IN 47405 USA
关键词
source separation; unsupervised learning; mixture invariant training; real-world audio processing;
D O I
10.1109/ICASSP43922.2022.9747855
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The recently-proposed mixture invariant training (MixIT) is an unsupervised method for training single-channel sound separation models because it does not require ground-truth isolated reference sources. In this paper, we investigate using MixIT to adapt a separation model on real far-field overlapping reverberant and noisy speech data from the AMI Corpus. The models are tested on real AMI recordings containing overlapping speech, and are evaluated subjectively by human listeners. To objectively evaluate our models, we also devise a synthetic AMI test set. For human evaluations on real recordings, we also propose a modification of the standard MUSHRA protocol to handle imperfect reference signals, which we call MUSHIRA. Holding network architectures constant, we find that a fine-tuned semi-supervised model yields the largest SI-SNR improvement, PESQ scores, and human listening ratings across synthetic and real datasets, outperforming unadapted generalist models trained on orders of magnitude more data. Our results show that unsupervised learning through MixIT enables model adaptation on real-world unlabeled spontaneous speech recordings.
引用
收藏
页码:686 / 690
页数:5
相关论文
共 50 条
  • [1] On monoaural speech enhancement for automatic recognition of real noisy speech using mixture invariant training
    Zhang, Jisi
    Zorila, Catalin
    Doddipatla, Rama
    Barker, Jon
    INTERSPEECH 2022, 2022, : 1056 - 1060
  • [2] Unsupervised Sound Separation Using Mixture Invariant Training
    Wisdom, Scott
    Tzinis, Efthymios
    Erdogan, Hakan
    Weiss, Ron J.
    Wilson, Kevin
    Hershey, John R.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [3] MixCycle: Unsupervised Speech Separation via Cyclic Mixture Permutation Invariant Training
    Karamatli, Ertug
    Kirbiz, Serap
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2637 - 2641
  • [4] Domain Adapting Deep Reinforcement Learning for Real-World Speech Emotion Recognition
    Rajapakshe, Thejan
    Rana, Rajib
    Khalifa, Sara
    Schuller, Bjoern W.
    IEEE ACCESS, 2024, 12 : 193101 - 193114
  • [5] Adapting NeuroVanguard to real-world challenges
    Giglio, Andres
    Pino, Monserrat
    Ferre, Andres
    Reccius, Andres
    CRITICAL CARE, 2024, 28 (01)
  • [7] A speech translation system applied to a real-world task/domain and its evaluation using real-world speech data
    Nakamura, A
    Naito, M
    Tsukada, H
    Gruhn, R
    Sumita, E
    Kashioka, N
    Nakajima, H
    Shimizu, T
    Sagisaka, Y
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2001, E84D (01): : 142 - 154
  • [8] Training students using real-world situations
    King, TD
    WELDING JOURNAL, 2001, 80 (07) : 102 - 104
  • [9] Training students using real-world situations
    King, T.D.
    2001, American Welding Society (80): : 102 - 104
  • [10] A blind separation algorithm and its application to real-world speech signals
    Matsuoka, Kiyotoshi
    Yamada, Seiji
    Matsuno, Masafumi
    Yamamoto, Takayoshi
    WSEAS Transactions on Circuits and Systems, 2005, 4 (09): : 1094 - 1103