A Two-Stage Beamforming and Diffusion-Based Refiner System for 3D Speech Enhancement

被引:0
|
作者
Chen, Feilong [1 ]
Lin, Wenmo [1 ]
Sun, Chengli [1 ]
Guo, Qiaosheng [2 ]
机构
[1] Nanchang Hangkong Univ, Sch Informat Engn, Nanchang 330063, Peoples R China
[2] Chaoyang Jushengtai Xinfeng Technol Co Ltd, Ganzhou 341001, Peoples R China
关键词
Speech enhancement; 3D speech signal; Diffusion model; Beamforming; Multi-channel;
D O I
10.1007/s00034-024-02652-y
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Speech enhancement in 3D reverberant environments is a challenging and significant problem for many downstream applications, such as speech recognition, speaker identification, and audio analysis. Existing deep neural network models have shown efficacy for 3D speech enhancement tasks, but they often introduce distortions or unnatural artifacts in the enhanced speech. In this work, we propose a novel two-stage refiner system that integrates a neural beamforming network and a diffusion model for robust 3D speech enhancement. The neural beamforming network performs spatial filtering to suppress the noise and reverberation; while, the diffusion model leverages its generative capability to restore the missing or distorted speech components from the beamformed output. To the best of our knowledge, this is the first work that applies the diffusion model as a backend refiner to 3D speech enhancement. We investigate the effect of training the diffusion model with either enhanced speech or clean speech, and find that clean speech can better capture the prior knowledge of speech components and improve the speech recovery. We evaluate our proposed system on different datasets and beamformer architectures, and show that it achieves consistent improvements in metrics like WER and NISQA, indicating that the diffusion model has strong generalization ability and can serve as a backend refinement module for 3D speech enhancement, regardless of the front-end beamforming network. Our work demonstrates the effectiveness of integrating discriminative and generative models for robust 3D speech enhancement, and also opens up a new direction for applying generative diffusion models to 3D speech processing tasks, which can be used as a backend to various beamforming enhancement methods.
引用
收藏
页码:4369 / 4389
页数:21
相关论文
共 50 条
  • [21] A two-stage method for single-channel speech enhancement
    Hamid, ME
    Fukabayashi, T
    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2006, E89A (04) : 1058 - 1068
  • [22] The management of mental health in a smart medical dialogue system based on a two-stage attention speech enhancement module
    Quan, Yongtai
    COMPUTER SPEECH AND LANGUAGE, 2025, 92
  • [23] Two-stage dynamic deformation for construction of 3D models
    Chen, SW
    Stockman, G
    Dai, CY
    Chuang, CP
    GRAPHICAL MODELS AND IMAGE PROCESSING, 1996, 58 (05): : 484 - 493
  • [24] Efficient 3D Correspondence Grouping by Two-Stage Filtering
    Lu, Rongrong
    Zhu, Feng
    Wu, Qingxiao
    Kong, Yanzi
    TENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING (ICGIP 2018), 2019, 11069
  • [25] TSTNN: TWO-STAGE TRANSFORMER BASED NEURAL NETWORK FOR SPEECH ENHANCEMENT IN THE TIME DOMAIN
    Wan, Kai
    He, Bengbeng
    Zh, Wei-Ping
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7098 - 7102
  • [26] DIFFUSION-BASED SPEECH ENHANCEMENT WITH A WEIGHTED GENERATIVE-SUPERVISED LEARNING LOSS
    Ayilo, Jean-Eudes
    Sadeghi, Mostafa
    Serizel, Romain
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, : 12506 - 12510
  • [27] Approach for optimizing 3D highway alignments based on two-stage dynamic programming
    Li, Wei
    Pu, Hao
    Zhao, Haifeng
    Liu, Wei
    Journal of Software, 2013, 8 (11) : 2967 - 2973
  • [28] A Two-Stage Clustering Based 3D Visual Saliency Model for Dynamic Scenarios
    Yang, You
    Li, Bei
    Li, Pian
    Liu, Qiong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (04) : 809 - 820
  • [29] Reducing the Prior Mismatch of Stochastic Differential Equations for Diffusion-based Speech Enhancement
    Lay, Bunlong
    Welker, Simon
    Richter, Julius
    Gerkmann, Timo
    INTERSPEECH 2023, 2023, : 3809 - 3813
  • [30] The Effect of Training Dataset Size on Discriminative and Diffusion-Based Speech Enhancement Systems
    Gonzalez, Philippe
    Tan, Zheng-Hua
    Ostergaard, Jan
    Jensen, Jesper
    Alstrom, Tommy Sonne
    May, Tobias
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2225 - 2229