Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition

被引：41

作者：

Shimada, Kazuki ^{[1
]}

Bando, Yoshiaki ^{[1
]}

Mimura, Masato ^{[1
]}

Itoyama, Katsutoshi ^{[1
]}

Yoshii, Kazuyoshi ^{[1
,2
]}

Kawahara, Tatsuya ^{[1
]}

机构：

[1] Kyoto Univ, Grad Sch Informat, Kyoto 6068501, Japan

[2] RIKEN, Ctr Adv Intelligence Project, Tokyo 1030027, Japan

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2019年 / 27卷 / 05期

关键词：

Noisy speech recognition; speech enhancement; multichannel nonnegative matrix factorization; beamforming; CONVOLUTIVE MIXTURES; NEURAL-NETWORKS; SEPARATION; SINGLE; MODEL;

D O I：

10.1109/TASLP.2019.2907015

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper describes multichannel speech enhancement for improving automatic speech recognition (ASR) in noisy environments. Recently, the minimum variance distortionless response (MVDR) beamforming has widely been used because it works well if the steering vector of speech and the spatial covariance matrix (SCM) of noise are given. To estimating such spatial information, conventional studies take a supervised approach that classifies each time-frequency (TF) bin into noise or speech by training a deep neural network (DNN). The performance of ASR, however, is degraded in an unknown noisy environment. To solve this problem, we take an unsupervised approach that decomposes each TF bin into the sum of speech and noise by using multichannel nonnegative matrix factorization (MNMF). This enables us to accurately estimate the SCMs of speech and noise not from observed noisy mixtures but from separated speech and noise components. In this paper, we propose online MVDR beamforming by effectively initializing and incrementally updating the parameters of MNMF. Another main contribution is to comprehensively investigate the performances of ASR obtained by various types of spatial filters, i.e., time-invariant and variant versions of MVDR beamformers and those of rank-1 and full-rank multichannel Wiener filters, in combination with MNMF. The experimental results showed that the proposed method outperformed the state-of-the-art DNN-based beamforming method in unknown environments that did not match training data.

引用

页码：960 / 971

页数：12

共 50 条

[1] Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition
Woo Lee, Geon
Kook Kim, Hong
Kong, Duk-Jo
IEEE ACCESS, 2024, 12 : 72707 - 72720
[2] An overview of noise-robust automatic speech recognition
Li, Jinyu
Deng, Li
Gong, Yifan
Haeb-Umbach, Reinhold
IEEE Transactions on Audio, Speech and Language Processing, 2014, 22 (04): : 745 - 777
[3] An Overview of Noise-Robust Automatic Speech Recognition
Li, Jinyu
Deng, Li
Gong, Yifan
Haeb-Umbach, Reinhold
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (04) : 745 - 777
[4] FLEXIBLE MULTICHANNEL SPEECH ENHANCEMENT FOR NOISE-ROBUST FRONTEND
Jukic, Ante
Balam, Jagadeesh
Ginsburg, Boris
2023 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, WASPAA, 2023,
[5] Factorial Speech Processing Models for Noise-Robust Automatic Speech Recognition
Khademian, Mahdi
Homayounpour, Mohammad Mehdi
2015 23RD IRANIAN CONFERENCE ON ELECTRICAL ENGINEERING (ICEE), 2015, : 637 - 642
[6] Unsupervised modulation filter learning for noise-robust speech recognition
Agrawal, Purvi
Ganapathy, Sriram
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 142 (03): : 1686 - 1692
[7] Noise-Robust Algorithm of Speech Features Extraction for Automatic Speech Recognition System
Yakhnev, A. N.
Pisarev, A. S.
PROCEEDINGS OF THE XIX IEEE INTERNATIONAL CONFERENCE ON SOFT COMPUTING AND MEASUREMENTS (SCM 2016), 2016, : 206 - 208
[8] INCORPORATING MASK MODELLING FOR NOISE-ROBUST AUTOMATIC SPEECH RECOGNITION
Koekueer, Muenevver
Jancovic, Peter
2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 3929 - 3932
[9] Empirical Mode Decomposition For Noise-Robust Automatic Speech Recognition
Wu, Kuo-Hao
Chen, Chia-Ping
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2074 - 2077
[10] A companding front end for noise-robust automatic speech recognition
Guinness, J
Raj, B
Schmidt-Nielsen, B
Turicchia, L
Sarpeshkar, R
2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5: SPEECH PROCESSING, 2005, : 249 - 252

← 1 2 3 4 5 →