F0 Estimation and Voicing Detection With Cascade Architecture in Noisy Speech

被引:2
|
作者
Zhang, Yixuan [1 ]
Wang, Heming [1 ]
Wang, Deliang [2 ,3 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[2] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[3] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA
关键词
Estimation; Noise measurement; Multitasking; Speech enhancement; Convolution; Training; Speech processing; Complex domain processing; densely-connected convolutional recurrent neural network; multi-task learning; neural cascade architecture; pitch tracking; voicing detection; MULTIPITCH TRACKING; PITCH; ALGORITHM; MASKING; ROBUST;
D O I
10.1109/TASLP.2023.3313427
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
As a fundamental problem in speech processing, pitch tracking has been studied for decades. While strong performance has been achieved on clean speech, pitch tracking in noisy speech is still challenging. Severe non-stationary noises not only corrupt the harmonic structure in voiced intervals but also make it difficult to determine the existence of voiced speech. Given the importance of voicing detection for pitch tracking, this study proposes a neural cascade architecture that jointly performs pitch estimation and voicing detection. The cascade architecture optimizes a speech enhancement module and a pitch tracking module, and is trained in a speaker-independent and noise-independent way. It is observed that incorporating the enhancement module improves both pitch estimation and voicing detection accuracy, especially in low signal-to-noise ratio (SNR) conditions. In addition, compared with frameworks that combine corresponding single-task models, the proposed multi-task framework achieves better performance and is more efficient. Experimental results show that the proposed method is robust to different noise conditions and substantially outperforms other competitive pitch tracking methods.
引用
收藏
页码:3760 / 3770
页数:11
相关论文
共 50 条
  • [1] F0 estimation of noisy speech based on complex speech analysis
    Kinjo, Tatsuhiko
    Funaki, Keiichi
    2006 IEEE 12TH DIGITAL SIGNAL PROCESSING WORKSHOP & 4TH IEEE SIGNAL PROCESSING EDUCATION WORKSHOP, VOLS 1 AND 2, 2006, : 434 - 437
  • [2] ROBUST F0 ESTIMATION IN NOISY SPEECH SIGNALS USING SHIFT AUTOCORRELATION
    Kurth, Frank
    Cornaggia-Urrigshardt, Alessia
    Urrigshardt, Sebastian
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [3] SAFE: a Statistical Algorithm for F0 Estimation for Both Clean and Noisy Speech
    Chu, Wei
    Alwan, Abeer
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2598 - 2601
  • [4] Nebula: F0 Estimation and Voicing Detection by Modeling the Statistical Properties of Feature Extractors
    Hua, Kanru
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 337 - 341
  • [5] Robust F0 estimation based on complex LPC analysis for IRS filtered noisy speech
    Funaki, Keiichi
    Kinjo, Tatsuhiko
    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2007, E90A (08) : 1579 - 1586
  • [6] F0 ESTIMATION FOR NOISY SPEECH BASED ON EXPLORING LOCAL TIME-FREQUENCY SEGMENT
    Wang, Dongmei
    Hansen, John H. L.
    Tobey, Emily
    2015 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2015,
  • [7] Voicing detection in noisy speech signal
    Bouzid, Aicha
    Ellouze, Noureddine
    IMAGE AND SIGNAL PROCESSING, 2008, 5099 : 544 - +
  • [8] Multiband statistical learning for F0 estimation in speech
    Sha, F
    Burgoyne, JA
    Saul, LK
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: DESIGN AND IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS INDUSTRY TECHNOLOGY TRACKS MACHINE LEARNING FOR SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING SIGNAL PROCESSING FOR EDUCATION, 2004, : 661 - 664
  • [9] Single and multiple F0 contour estimation through parametric spectrogram Modeling of speech in noisy environments
    Le Roux, Jonathan
    Kameoka, Hirokazu
    Ono, Nobutaka
    de Cheveigne, Alain
    Sagayama, Shigeki
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (04): : 1135 - 1145
  • [10] Harmonic-temporal clustering of speech for single and multiple F0 contour estimation in noisy environments
    Le Roux, Jonathan
    Kameoka, Hirokazu
    Ono, Nobutaka
    de Cheveign, Alain
    Sagayama, Shigeki
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 1053 - +