A Two-Stage Beamforming and Diffusion-Based Refiner System for 3D Speech Enhancement

被引：0

作者：

Chen, Feilong ^{[1
]}

Lin, Wenmo ^{[1
]}

Sun, Chengli ^{[1
]}

Guo, Qiaosheng ^{[2
]}

机构：

[1] Nanchang Hangkong Univ, Sch Informat Engn, Nanchang 330063, Peoples R China

[2] Chaoyang Jushengtai Xinfeng Technol Co Ltd, Ganzhou 341001, Peoples R China

来源：

CIRCUITS SYSTEMS AND SIGNAL PROCESSING | 2024年 / 43卷 / 7期

关键词：

Speech enhancement; 3D speech signal; Diffusion model; Beamforming; Multi-channel;

D O I：

10.1007/s00034-024-02652-y

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Speech enhancement in 3D reverberant environments is a challenging and significant problem for many downstream applications, such as speech recognition, speaker identification, and audio analysis. Existing deep neural network models have shown efficacy for 3D speech enhancement tasks, but they often introduce distortions or unnatural artifacts in the enhanced speech. In this work, we propose a novel two-stage refiner system that integrates a neural beamforming network and a diffusion model for robust 3D speech enhancement. The neural beamforming network performs spatial filtering to suppress the noise and reverberation; while, the diffusion model leverages its generative capability to restore the missing or distorted speech components from the beamformed output. To the best of our knowledge, this is the first work that applies the diffusion model as a backend refiner to 3D speech enhancement. We investigate the effect of training the diffusion model with either enhanced speech or clean speech, and find that clean speech can better capture the prior knowledge of speech components and improve the speech recovery. We evaluate our proposed system on different datasets and beamformer architectures, and show that it achieves consistent improvements in metrics like WER and NISQA, indicating that the diffusion model has strong generalization ability and can serve as a backend refinement module for 3D speech enhancement, regardless of the front-end beamforming network. Our work demonstrates the effectiveness of integrating discriminative and generative models for robust 3D speech enhancement, and also opens up a new direction for applying generative diffusion models to 3D speech processing tasks, which can be used as a backend to various beamforming enhancement methods.

引用

页码：4369 / 4389

页数：21

共 50 条

[41] Fast Two-Stage 3D Object Detection with Semantic Guidance
Huang Mang
Hui Bin
Liu Zhaoji
Jin Tianming
LASER & OPTOELECTRONICS PROGRESS, 2024, 61 (12)
[42] TSFF: a two-stage fusion framework for 3D object detection
Jiang, Guoqing
Li, Saiya
Huang, Ziyu
Cai, Guorong
Su, Jinhe
PEERJ COMPUTER SCIENCE, 2024, 10
[43] Speech Enhancement Based on Two-Stage Processing with Deep Neural Network for Laser Doppler Vibrometer
Cai, Chengkai
Iwai, Kenta
Nishiura, Takanobu
APPLIED SCIENCES-BASEL, 2023, 13 (03):
[44] Reconstruction of 3D genome architecture via a two-stage algorithm
Segal, Mark R.
Bengtsson, Henrik L.
BMC BIOINFORMATICS, 2015, 16
[45] Two-Stage Lesion Detection Approach Based on Dimension-Decomposition and 3D Context
Jiacheng Jiao
Haiwei Pan
Chunling Chen
Tao Jin
Yang Dong
Jingyi Chen
TsinghuaScienceandTechnology, 2022, 27 (01) : 103 - 113
[46] Two-Stage Lesion Detection Approach Based on Dimension-Decomposition and 3D Context
Jiao, Jiacheng
Pan, Haiwei
Chen, Chunling
Jin, Tao
Dong, Yang
Chen, Jingyi
TSINGHUA SCIENCE AND TECHNOLOGY, 2022, 27 (01) : 103 - 113
[47] Two-Stage RGB-Based Action Detection Using Augmented 3D Poses
Papadopoulos, Konstantinos
Ghorbel, Enjie
Baptista, Renato
Aouada, Djamila
Ottersten, Bjoern
COMPUTER ANALYSIS OF IMAGES AND PATTERNS, CAIP 2019, PT I, 2019, 11678 : 26 - 35
[48] Persistence for a Two-Stage Reaction-Diffusion System
Cantrell, Robert Stephen
Cosner, Chris
Martinez, Salome
MATHEMATICS, 2020, 8 (03)
[49] Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation
Shan, Wenkang
Liu, Zhenhua
Zhang, Xinfeng
Wang, Zhao
Han, Kai
Wang, Shanshe
Ma, Siwei
Gao, Wen
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 14715 - 14725
[50] DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation
Ju, Xiaoliang
Huang, Zhaoyang
Li, Yijin
Zhang, Guofeng
Qiao, Yu
Li, Hongsheng
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 4526 - 4535

← 1 2 3 4 5 →