Monaural Speech Separation by Means of Convolutive Nonnegative Matrix Partial Co-factorization in Low SNR Condition

被引:0
|
作者
Dong X.-L. [1 ]
Hu Y. [1 ]
Huang H. [1 ,2 ]
Wushour S. [1 ,2 ]
机构
[1] Department of Information Science and Engineering, Xinjiang University, Urumqi
[2] Laboratory of Multi-lingual Information Technology, Xinjiang University, Urumqi
来源
基金
中国国家自然科学基金;
关键词
Convolutive nonnegative matrix factorization (CNMF); Monaural speech; Nonnegative matrix partial co-factorization (NMPCF); Speech separation; Strong noise;
D O I
10.16383/j.aas.c180065
中图分类号
学科分类号
摘要
Nonnegative matrix partial co-factorization (NMPCF) is a joint matrix decomposition algorithm integrating prior knowledge of specific source to help separate specific source signal from monaural mixtures. Convolutive nonnegative matrix factorization (CNMF), which introduces the concept of a convolutive non-negative basis set during NMF process, opens up an interesting avenue of research in the field of monaural sound separation. On the basis of the above two algorithms, we propose a speech separation algorithm named as convolutive nonnegative matrix partial co-factorization (CNMPCF) for low signal noise ratio (SNR) monaural speech. Firstly, through a voice detection process exploring fundamental frequency estimation algorithm, we divide a mixture signal into vocal and nonvocal parts, thus those vocal parts are used as test mixture signal while the nonvocal parts (pure noise) participat in the partial joint decomposition. After CNMPCF, we can obtain the separated speech spectrogram. Then, the separated speech signal can reconstructed through Inverse short time fourier transformation. In the experiments, we select 5 SNRs from 0 dB to -12 dB at -3 dB intervals to obtain low SNR mixture speeches. The results demonstrate that the proposed CNMPCF approach has superiority over sparse convolutive nonnegative matrix factorization (SCNMF) and NMPCF under different noise types and noise intensities. Copyright © 2020 Acta Automatica Sinica. All rights reserved.
引用
收藏
页码:1200 / 1209
页数:9
相关论文
共 22 条
  • [11] Kim M, Yoo J, Kang K, Choi S., Blind rhythmic source separation: Nonnegativity and repeatability, Proceedings of the 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 2006-2009, (2010)
  • [12] Yoo J, Kim M, Kang K, Choi S., Nonnegative matrix partial co-factorization for drum source separation, Proceedings of the 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1942-1945, (2010)
  • [13] Kim M, Yoo J, Kang K, Choi S., Nonnegative matrix partial co-factorization for spectral and temporal drum source separation, IEEE Journal of Selected Topics in Signal Processing, 5, 6, pp. 1192-1204, (2011)
  • [14] Hu Y, Liu G Z., Separation of singing voice using nonnegative matrix partial co-factorization for singer identification, IEEE Transactions on Audio, Speech, and Language Processing, 23, 4, pp. 643-653, (2015)
  • [15] Lu Cheng, Tian Meng, Zhou Jian, Wang Hua-Bin, Tao Liang, A single-channel speech enhancement approach using convolutive non-negative matrix factorization with L1/2 sparse constraint, Acta Acustica, 42, 3, pp. 377-384, (2017)
  • [16] Natarajan B K., Sparse approximate solutions to linear systems, SIAM Journal on Computing, 24, 2, pp. 227-234, (1995)
  • [17] Candes E J, Li X D, Ma Y, Wright J., Robust principal component analysis?, Journal of the ACM, 58, 3, (2009)
  • [18] Boersma P., Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound, Proceedings of the Institute of Phonetic Sciences, 17, 97-110 2013, (1993)
  • [19] Rix A W, Beerends J G, Hollier M P, Hekstra A P., Perceptual evaluation of speech quality (PESQ) --- a new method for speech quality assessment of telephone networks and codecs, Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 749-752, (2001)
  • [20] Vincent E, Gribonval R, Fevotte C., Performance measurement in blind audio source separation, IEEE Transactions on Audio, Speech, and Language Processing, 14, 4, pp. 1462-1469, (2006)