Long short-term memory for speaker generalization in supervised speech separation

被引:186
|
作者
Chen, Jitong [1 ]
Wang, DeLiang [1 ,2 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[2] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA
来源
关键词
NEURAL-NETWORKS; ALGORITHM; INTELLIGIBILITY; NOISE; MASKS;
D O I
10.1121/1.4986931
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech separation can be formulated as learning to estimate a time-frequency mask from acoustic features extracted from noisy speech. For supervised speech separation, generalization to unseen noises and unseen speakers is a critical issue. Although deep neural networks (DNNs) have been successful in noise-independent speech separation, DNNs are limited in modeling a large number of speakers. To improve speaker generalization, a separation model based on long short-term memory (LSTM) is proposed, which naturally accounts for temporal dynamics of speech. Systematic evaluation shows that the proposed model substantially outperforms a DNN-based model on unseen speakers and unseen noises in terms of objective speech intelligibility. Analyzing LSTM internal representations reveals that LSTM captures long-term speech contexts. It is also found that the LSTM model is more advantageous for low-latency speech separation and it, without future frames, performs better than the DNN model with future frames. The proposed model represents an effective approach for speaker-and noise-independent speech separation. (C) 2017 Acoustical Society of America.
引用
收藏
页码:4705 / 4714
页数:10
相关论文
共 50 条
  • [1] Long Short-Term Memory for Speaker Generalization in Supervised Speech Separation
    Chen, Jitong
    Wang, DeLiang
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3314 - 3318
  • [3] Enhanced speech separation through a supervised approach using bidirectional long short-term memory in dual domains
    Basir, Samiul
    Hosen, Md Shakhawat
    Hossain, Md Nahid
    Aktaruzzaman, Md
    Ali, Md Sadek
    Islam, Md Shohidul
    COMPUTERS & ELECTRICAL ENGINEERING, 2024, 118
  • [4] Modeling Speaker Variability Using Long Short-Term Memory Networks for Speech Recognition
    Li, Xiangang
    Wu, Xihong
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 1086 - 1090
  • [5] Research on Long Short-Term Memory Networks Speech Separation Algorithm Based on Beamforming
    Lan Chaofeng
    Liu Yan
    Zhao Hongyun
    Liu Chundong
    JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2022, 44 (07) : 2531 - 2538
  • [6] Speech Dereverberation Using Long Short-Term Memory
    Mimura, Masato
    Sakai, Shinsuke
    Kawahara, Tatsuya
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2435 - 2439
  • [7] Long Short-term Memory for Tibetan Speech Recognition
    Wang, Weizhe
    Chen, Ziyan
    Yang, Hongwu
    PROCEEDINGS OF 2020 IEEE 4TH INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC 2020), 2020, : 1059 - 1063
  • [8] Speaker-Aware Long Short-Term Memory Multi-Task Learning for Speech Recognition
    Pironkov, Gueorgui
    Dupont, Stephane
    Dutoit, Thierry
    2016 24TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2016, : 1911 - 1915
  • [9] Sequentially Supervised Long Short-Term Memory for Gesture Recognition
    Wang, Peisong
    Song, Qiang
    Han, Hua
    Cheng, Jian
    COGNITIVE COMPUTATION, 2016, 8 (05) : 982 - 991
  • [10] Sequentially Supervised Long Short-Term Memory for Gesture Recognition
    Peisong Wang
    Qiang Song
    Hua Han
    Jian Cheng
    Cognitive Computation, 2016, 8 : 982 - 991