ADAPTING SPEECH SEPARATION TO REAL-WORLD MEETINGS USING MIXTURE INVARIANT TRAINING

被引:7
|
作者
Sivaraman, Aswin [1 ,2 ]
Wisdom, Scott [1 ]
Erdogan, Hakan [1 ]
Hershey, John R. [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] Indiana Univ, Bloomington, IN 47405 USA
关键词
source separation; unsupervised learning; mixture invariant training; real-world audio processing;
D O I
10.1109/ICASSP43922.2022.9747855
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The recently-proposed mixture invariant training (MixIT) is an unsupervised method for training single-channel sound separation models because it does not require ground-truth isolated reference sources. In this paper, we investigate using MixIT to adapt a separation model on real far-field overlapping reverberant and noisy speech data from the AMI Corpus. The models are tested on real AMI recordings containing overlapping speech, and are evaluated subjectively by human listeners. To objectively evaluate our models, we also devise a synthetic AMI test set. For human evaluations on real recordings, we also propose a modification of the standard MUSHRA protocol to handle imperfect reference signals, which we call MUSHIRA. Holding network architectures constant, we find that a fine-tuned semi-supervised model yields the largest SI-SNR improvement, PESQ scores, and human listening ratings across synthetic and real datasets, outperforming unadapted generalist models trained on orders of magnitude more data. Our results show that unsupervised learning through MixIT enables model adaptation on real-world unlabeled spontaneous speech recordings.
引用
收藏
页码:686 / 690
页数:5
相关论文
共 50 条
  • [31] STUDENTS WANT MORE REAL-WORLD TRAINING
    JOHNSON, H
    INDUSTRIAL ENGINEERING, 1995, : 8 - 8
  • [32] The training and transfer of real-world perceptual expertise
    Tanaka, JW
    Curran, T
    Sheinberg, DL
    PSYCHOLOGICAL SCIENCE, 2005, 16 (02) : 145 - 151
  • [33] Segment-Less Continuous Speech Separation of Meetings: Training and Evaluation Criteria
    Neumann, Thilo von
    Kinoshita, Keisuke
    Boeddeker, Christoph
    Delcroix, Marc
    Haeb-Umbach, Reinhold
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 576 - 589
  • [34] EQUALIZATION MATCHING OF SPEECH RECORDINGS IN REAL-WORLD ENVIRONMENTS
    Germain, Francois G.
    Mysore, Gautham J.
    Fujioka, Takako
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 609 - 613
  • [35] A Real-World Emotional Speech Corpus for Modern Greek
    Kostoulas, Theodoros
    Ganchev, Todor
    Mporas, Iosif
    Fakotakis, Nikos
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 2676 - 2680
  • [36] DISCRIMINATION BETWEEN SINGING AND SPEECH IN REAL-WORLD AUDIO
    Thompson, Brian
    2014 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY SLT 2014, 2014, : 407 - 412
  • [37] Segmentation and its real-world applications in speech processing
    Sattar, Farook
    Nilsson, Mikael
    Claesson, Ingvar
    2007 9TH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS, VOLS 1-3, 2007, : 788 - +
  • [38] A real-world noise removal with wavelet speech feature
    Chiluveru, Samba Raju
    Tripathy, Manoj
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2020, 23 (03) : 683 - 693
  • [39] A real-world noise removal with wavelet speech feature
    Samba Raju Chiluveru
    Manoj Tripathy
    International Journal of Speech Technology, 2020, 23 : 683 - 693
  • [40] Overlap Aware Continuous Speech Separation without Permutation Invariant Training
    Yu, Linfeng
    Zhang, Wangyou
    Li, Chenda
    Qian, Yanmin
    INTERSPEECH 2023, 2023, : 3512 - 3516