F0 Estimation and Voicing Detection With Cascade Architecture in Noisy Speech

被引：2

作者：

Zhang, Yixuan ^{[1
]}

Wang, Heming ^{[1
]}

Wang, Deliang ^{[2
,3
]}

机构：

[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA

[2] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA

[3] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2023年 / 31卷

关键词：

Estimation; Noise measurement; Multitasking; Speech enhancement; Convolution; Training; Speech processing; Complex domain processing; densely-connected convolutional recurrent neural network; multi-task learning; neural cascade architecture; pitch tracking; voicing detection; MULTIPITCH TRACKING; PITCH; ALGORITHM; MASKING; ROBUST;

D O I：

10.1109/TASLP.2023.3313427

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

As a fundamental problem in speech processing, pitch tracking has been studied for decades. While strong performance has been achieved on clean speech, pitch tracking in noisy speech is still challenging. Severe non-stationary noises not only corrupt the harmonic structure in voiced intervals but also make it difficult to determine the existence of voiced speech. Given the importance of voicing detection for pitch tracking, this study proposes a neural cascade architecture that jointly performs pitch estimation and voicing detection. The cascade architecture optimizes a speech enhancement module and a pitch tracking module, and is trained in a speaker-independent and noise-independent way. It is observed that incorporating the enhancement module improves both pitch estimation and voicing detection accuracy, especially in low signal-to-noise ratio (SNR) conditions. In addition, compared with frameworks that combine corresponding single-task models, the proposed multi-task framework achieves better performance and is more efficient. Experimental results show that the proposed method is robust to different noise conditions and substantially outperforms other competitive pitch tracking methods.

引用

页码：3760 / 3770

页数：11

共 50 条

[1] F0 estimation of noisy speech based on complex speech analysis
Kinjo, Tatsuhiko
Funaki, Keiichi
2006 IEEE 12TH DIGITAL SIGNAL PROCESSING WORKSHOP & 4TH IEEE SIGNAL PROCESSING EDUCATION WORKSHOP, VOLS 1 AND 2, 2006, : 434 - 437
[2] ROBUST F0 ESTIMATION IN NOISY SPEECH SIGNALS USING SHIFT AUTOCORRELATION
Kurth, Frank
Cornaggia-Urrigshardt, Alessia
Urrigshardt, Sebastian
2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
[3] SAFE: a Statistical Algorithm for F0 Estimation for Both Clean and Noisy Speech
Chu, Wei
Alwan, Abeer
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2598 - 2601
[4] Nebula: F0 Estimation and Voicing Detection by Modeling the Statistical Properties of Feature Extractors
Hua, Kanru
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 337 - 341
[5] Robust F0 estimation based on complex LPC analysis for IRS filtered noisy speech
Funaki, Keiichi
Kinjo, Tatsuhiko
IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2007, E90A (08) : 1579 - 1586
[6] F0 ESTIMATION FOR NOISY SPEECH BASED ON EXPLORING LOCAL TIME-FREQUENCY SEGMENT
Wang, Dongmei
Hansen, John H. L.
Tobey, Emily
2015 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2015,
[7] Voicing detection in noisy speech signal
Bouzid, Aicha
Ellouze, Noureddine
IMAGE AND SIGNAL PROCESSING, 2008, 5099 : 544 - +
[8] Multiband statistical learning for F0 estimation in speech
Sha, F
Burgoyne, JA
Saul, LK
2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: DESIGN AND IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS INDUSTRY TECHNOLOGY TRACKS MACHINE LEARNING FOR SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING SIGNAL PROCESSING FOR EDUCATION, 2004, : 661 - 664
[9] Single and multiple F0 contour estimation through parametric spectrogram Modeling of speech in noisy environments
Le Roux, Jonathan
Kameoka, Hirokazu
Ono, Nobutaka
de Cheveigne, Alain
Sagayama, Shigeki
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (04): : 1135 - 1145
[10] Harmonic-temporal clustering of speech for single and multiple F0 contour estimation in noisy environments
Le Roux, Jonathan
Kameoka, Hirokazu
Ono, Nobutaka
de Cheveign, Alain
Sagayama, Shigeki
2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 1053 - +

← 1 2 3 4 5 →