F0 Estimation and Voicing Detection With Cascade Architecture in Noisy Speech

被引：2

作者：

Zhang, Yixuan ^{[1
]}

Wang, Heming ^{[1
]}

Wang, Deliang ^{[2
,3
]}

机构：

[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA

[2] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA

[3] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2023年 / 31卷

关键词：

Estimation; Noise measurement; Multitasking; Speech enhancement; Convolution; Training; Speech processing; Complex domain processing; densely-connected convolutional recurrent neural network; multi-task learning; neural cascade architecture; pitch tracking; voicing detection; MULTIPITCH TRACKING; PITCH; ALGORITHM; MASKING; ROBUST;

D O I：

10.1109/TASLP.2023.3313427

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

As a fundamental problem in speech processing, pitch tracking has been studied for decades. While strong performance has been achieved on clean speech, pitch tracking in noisy speech is still challenging. Severe non-stationary noises not only corrupt the harmonic structure in voiced intervals but also make it difficult to determine the existence of voiced speech. Given the importance of voicing detection for pitch tracking, this study proposes a neural cascade architecture that jointly performs pitch estimation and voicing detection. The cascade architecture optimizes a speech enhancement module and a pitch tracking module, and is trained in a speaker-independent and noise-independent way. It is observed that incorporating the enhancement module improves both pitch estimation and voicing detection accuracy, especially in low signal-to-noise ratio (SNR) conditions. In addition, compared with frameworks that combine corresponding single-task models, the proposed multi-task framework achieves better performance and is more efficient. Experimental results show that the proposed method is robust to different noise conditions and substantially outperforms other competitive pitch tracking methods.

引用

页码：3760 / 3770

页数：11

共 50 条

[21] Generative modeling of speech F0 contours
Kameoka, Hirokazu
Yoshizato, Kota
Ishihara, Tatsuma
Ohishi, Yasunori
Kashino, Kunio
Sagayama, Shigeki
14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 1825 - 1829
[22] Using Noisy Speech to Study the Robustness of a Continuous F0 Modelling Method in HMM-based Speech Synthesis
Ogbureke, Kalu U.
Cabral, Joao P.
Carson-Berndsen, Julie
PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON SPEECH PROSODY, VOLS I AND II, 2012, : 67 - 70
[23] F0 generation in a text-to-speech system using a database of natural F0 patterns
da Silva, CH
Nagle, EJ
Runstein, F
Violaro, F
ITS '98 PROCEEDINGS - SBT/IEEE INTERNATIONAL TELECOMMUNICATIONS SYMPOSIUM, VOLS 1 AND 2, 1998, : 213 - 218
[24] F0 ESTIMATION USING SRH BASED ON TV-CAR SPEECH ANALYSIS
Funaki, Keiichi
Higa, Takehito
2012 PROCEEDINGS OF THE 20TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2012, : 2777 - 2781
[25] Effects of F0 Estimation Algorithms on Ultrasound- Based Silent Speech Interfaces
Dai, Pengyu
Al-Radhi, Mohammed Salah
Csapo, Tamas Gabor
2021 INTERNATIONAL CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN-COMPUTER DIALOGUE (SPED), 2021, : 47 - 51
[26] On Evaluation of the F0 estimation based on time-varying complex speech analysis
Funaki, Keiichi
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 637 - 640
[27] Investigation of Prosodic F0 Layers in Hierarchical F0 Modeling for HMM-based Speech Synthesis
Lei, Ming
Wu, Yi-Jian
Ling, Zhen-Hua
Dai, Li-Rong
2010 IEEE 10TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS (ICSP2010), VOLS I-III, 2010, : 613 - +
[28] PARAMETERS OF SPEECH RATE PERCEPTION IN GERMAN WORDS AND SENTENCES - DURATION, F0 MOVEMENT, AND F0 LEVEL
KOHLER, KJ
LANGUAGE AND SPEECH, 1986, 29 : 115 - 139
[29] Estimation of the radii of the scalar/isoscalar mesons f0(980), f0(1300) and broad state f0(1530+90-250)
Anisovich, VV
Bugg, DV
Sarantsev, AV
PHYSICS LETTERS B, 1998, 437 (1-2) : 209 - 217
[30] Generation of F0 contours for Vietnamese speech synthesis
Do Dat Tran
Castelli, Eric
2010 THIRD INTERNATIONAL CONFERENCE ON COMMUNICATIONS AND ELECTRONICS (ICCE), 2010, : 158 - 162

← 1 2 3 4 5 →