F0 Estimation and Voicing Detection With Cascade Architecture in Noisy Speech

被引:2
|
作者
Zhang, Yixuan [1 ]
Wang, Heming [1 ]
Wang, Deliang [2 ,3 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[2] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[3] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA
关键词
Estimation; Noise measurement; Multitasking; Speech enhancement; Convolution; Training; Speech processing; Complex domain processing; densely-connected convolutional recurrent neural network; multi-task learning; neural cascade architecture; pitch tracking; voicing detection; MULTIPITCH TRACKING; PITCH; ALGORITHM; MASKING; ROBUST;
D O I
10.1109/TASLP.2023.3313427
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
As a fundamental problem in speech processing, pitch tracking has been studied for decades. While strong performance has been achieved on clean speech, pitch tracking in noisy speech is still challenging. Severe non-stationary noises not only corrupt the harmonic structure in voiced intervals but also make it difficult to determine the existence of voiced speech. Given the importance of voicing detection for pitch tracking, this study proposes a neural cascade architecture that jointly performs pitch estimation and voicing detection. The cascade architecture optimizes a speech enhancement module and a pitch tracking module, and is trained in a speaker-independent and noise-independent way. It is observed that incorporating the enhancement module improves both pitch estimation and voicing detection accuracy, especially in low signal-to-noise ratio (SNR) conditions. In addition, compared with frameworks that combine corresponding single-task models, the proposed multi-task framework achieves better performance and is more efficient. Experimental results show that the proposed method is robust to different noise conditions and substantially outperforms other competitive pitch tracking methods.
引用
收藏
页码:3760 / 3770
页数:11
相关论文
共 50 条
  • [21] Generative modeling of speech F0 contours
    Kameoka, Hirokazu
    Yoshizato, Kota
    Ishihara, Tatsuma
    Ohishi, Yasunori
    Kashino, Kunio
    Sagayama, Shigeki
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 1825 - 1829
  • [22] Using Noisy Speech to Study the Robustness of a Continuous F0 Modelling Method in HMM-based Speech Synthesis
    Ogbureke, Kalu U.
    Cabral, Joao P.
    Carson-Berndsen, Julie
    PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON SPEECH PROSODY, VOLS I AND II, 2012, : 67 - 70
  • [23] F0 generation in a text-to-speech system using a database of natural F0 patterns
    da Silva, CH
    Nagle, EJ
    Runstein, F
    Violaro, F
    ITS '98 PROCEEDINGS - SBT/IEEE INTERNATIONAL TELECOMMUNICATIONS SYMPOSIUM, VOLS 1 AND 2, 1998, : 213 - 218
  • [24] F0 ESTIMATION USING SRH BASED ON TV-CAR SPEECH ANALYSIS
    Funaki, Keiichi
    Higa, Takehito
    2012 PROCEEDINGS OF THE 20TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2012, : 2777 - 2781
  • [25] Effects of F0 Estimation Algorithms on Ultrasound- Based Silent Speech Interfaces
    Dai, Pengyu
    Al-Radhi, Mohammed Salah
    Csapo, Tamas Gabor
    2021 INTERNATIONAL CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN-COMPUTER DIALOGUE (SPED), 2021, : 47 - 51
  • [26] On Evaluation of the F0 estimation based on time-varying complex speech analysis
    Funaki, Keiichi
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 637 - 640
  • [27] Investigation of Prosodic F0 Layers in Hierarchical F0 Modeling for HMM-based Speech Synthesis
    Lei, Ming
    Wu, Yi-Jian
    Ling, Zhen-Hua
    Dai, Li-Rong
    2010 IEEE 10TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS (ICSP2010), VOLS I-III, 2010, : 613 - +
  • [29] Estimation of the radii of the scalar/isoscalar mesons f0(980), f0(1300) and broad state f0(1530+90-250)
    Anisovich, VV
    Bugg, DV
    Sarantsev, AV
    PHYSICS LETTERS B, 1998, 437 (1-2) : 209 - 217
  • [30] Generation of F0 contours for Vietnamese speech synthesis
    Do Dat Tran
    Castelli, Eric
    2010 THIRD INTERNATIONAL CONFERENCE ON COMMUNICATIONS AND ELECTRONICS (ICCE), 2010, : 158 - 162