Joint Deep Neural Network for Single-Channel Speech Separation on Masking-Based Training Targets

被引:1
|
作者
Chen, Peng [1 ]
Nguyen, Binh Thien [2 ]
Geng, Yuting [2 ]
Iwai, Kenta [2 ]
Nishiura, Takanobu [2 ]
机构
[1] Ritsumeikan Univ, Grad Sch Informat Sci & Engn, Osaka 5678570, Japan
[2] Ritsumeikan Univ, Coll Informat Sci & Engn, Osaka, Ibaraki 5678570, Japan
来源
IEEE ACCESS | 2024年 / 12卷
基金
日本学术振兴会;
关键词
Training; Signal to noise ratio; Hidden Markov models; Speech recognition; Speech enhancement; Time-frequency analysis; Distortion measurement; Interference; Fitting; Artificial neural networks; Single-channel speech separation; time-frequency mask; deep neural network; joint network; ideal binary mask; ideal ratio mask; Wiener filter; spectral magnitude mask; SPEAKER RECOGNITION; ENHANCEMENT; NOISE; BINARY;
D O I
10.1109/ACCESS.2024.3479292
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Single-channel speech separation can be adopted in many applications. Time-frequency (T-F) masking is an effective method for single-channel speech separation. With advancements in deep learning, T-F masks have become used as a training target, achieving notable separation results. Among the numerous masks that have been proposed, the ideal binary mask (IBM), ideal ratio mask (IRM), Wiener filter (WF) and spectral magnitude mask (SMM) are commonly used and have proven effective, though their separation performance varies depending on the speech mixture and separation model. The existing approach mainly utilizes a single network to approximate the mask of the target speech. However, in mixed speech, there are segments where speech is mixed with other speech, segments where speech is mixed with silent intervals, and segments where high signal-to-noise ratio (SNR) speech is mixed due to pauses and variations in the speakers' intonation and emphasis. In this paper, we attempt to use different networks to handle speech segments containing various mixtures. In addition to the existing network, we introduce a network (using the Rectified Linear Unit as activation functions) to specifically address segments containing a mixture of speech and silence, as well as segments with high SNR speech mixtures. We conducted evaluation experiments on the speech separation of two speakers using the four aforementioned masks as training targets. The performance improvements observed in the evaluation experiments demonstrate the effectiveness of our proposed method based on the joint network compared to the conventional method based on the single network.
引用
收藏
页码:152036 / 152044
页数:9
相关论文
共 50 条
  • [21] A new feature set for masking-based monaural speech separation
    Pirhosseinloo, Shadi
    Brumberg, Jonathan S.
    2018 CONFERENCE RECORD OF 52ND ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS, AND COMPUTERS, 2018, : 828 - 832
  • [22] Deep clustering-based single-channel speech separation and recent advances
    Aihara, Ryo
    Wichern, Gordon
    Le Roux, Jonathan
    ACOUSTICAL SCIENCE AND TECHNOLOGY, 2020, 41 (02) : 465 - 471
  • [23] ONLINE DEEP ATTRACTOR NETWORK FOR REAL-TIME SINGLE-CHANNEL SPEECH SEPARATION
    Han, Cong
    Luo, Yi
    Mesgarani, Nima
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 361 - 365
  • [24] Deep Clustering in Complex Domain for Single-Channel Speech Separation
    Liu, Runling
    Tang, Yu
    Mang, Hongwei
    2022 IEEE 17TH CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA), 2022, : 1463 - 1468
  • [25] Features for Masking-Based Monaural Speech Separation in Reverberant Conditions
    Delfarah, Masood
    Wang, DeLiang
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (05) : 1085 - 1094
  • [26] Single-channel speech separation based on modulation frequency
    Gu, Lingyun
    Stern, Richard M.
    2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 25 - 28
  • [27] A Novel Single Channel Speech Enhancement Based on Joint Deep Neural Network and Wiener Filter
    Han, Wei
    Zhang, Xiongwei
    Min, Gang
    Zhou, Xingyu
    PROCEEDINGS OF 2015 IEEE INTERNATIONAL CONFERENCE ON PROGRESS IN INFORMATCS AND COMPUTING (IEEE PIC), 2015, : 163 - 167
  • [28] Joint Optimization of Perceptual Gain Function and Deep Neural Networks for Single-Channel Speech Enhancement
    Han, Wei
    Zhang, Xiongwei
    Min, Gang
    Zhou, Xingyu
    Sun, Meng
    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2017, E100A (02) : 714 - 717
  • [29] JOINT OPTIMIZATION OF AUDIBLE NOISE SUPPRESSION AND DEEP NEURAL NETWORKS FOR SINGLE-CHANNEL SPEECH ENHANCEMENT
    Han, Wei
    Zhang, Xiongwei
    Min, Gang
    Sun, Meng
    Yang, Jibin
    2016 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO (ICME), 2016,
  • [30] A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech
    Tu, Yan-Hui
    Du, Jun
    Lee, Chin-Hui
    JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2018, 90 (07): : 963 - 973