A DUAL-STAGED CONTEXT AGGREGATION METHOD TOWARDS EFFICIENT END-TO-END SPEECH ENHANCEMENT

被引:0
|
作者
Zhen, Kai [1 ,2 ]
Lee, Mi Suk [3 ]
Kim, Minje [1 ,2 ]
机构
[1] Indiana Univ, Luddy Sch Informat Comp & Engn, Bloomington, IN 47405 USA
[2] Indiana Univ, Cognit Sci Program, Bloomington, IN 47405 USA
[3] Elect & Telecommun Res Inst, Daejeon, South Korea
关键词
End-to-end; speech enhancement; context aggregation; residual learning; dilated convolution; recurrent network; NOISE;
D O I
10.1109/icassp40776.2020.9054499
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In speech enhancement, an end-to-end deep neural network converts a noisy speech signal to a clean speech directly in the time domain without time-frequency transformation or mask estimation. However, aggregating contextual information from a high-resolution time domain signal with an affordable model complexity still remains challenging. In this paper, we propose a densely connected convolutional and recurrent network (DCCRN), a hybrid architecture, to enable dual-staged temporal context aggregation. With the dense connectivity and cross-component identical shortcut, DCCRN consistently outperforms competing convolutional baselines with an average STOI improvement of 0.23 and PESQ of 1.38 at three SNR levels. The proposed method is computationally efficient with only 1.38 million parameters. The generalizability performance on the unseen noise types is still decent considering its low complexity, although it is relatively weaker comparing to Wave-U-Net with 7.25 times more parameters.
引用
收藏
页码:366 / 370
页数:5
相关论文
共 50 条
  • [21] UNIFIED END-TO-END SPEECH RECOGNITION AND ENDPOINTING FOR FAST AND EFFICIENT SPEECH SYSTEMS
    Bijwadia, Shaan
    Chang, Shuo-yiin
    Li, Bo
    Sainath, Tara
    Zhang, Chao
    He, Yanzhang
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 310 - 316
  • [22] Do End-to-End Speech Recognition Models Care About Context?
    Borgholt, Lasse
    Havtorn, Jakob D.
    Agic, Zeljko
    Sogaard, Anders
    Maaloe, Lars
    Igel, Christian
    INTERSPEECH 2020, 2020, : 4352 - 4356
  • [23] Exploiting Deep Sentential Context for Expressive End-to-End Speech Synthesis
    Yang, Fengyu
    Yang, Shan
    Wu, Qinghua
    Wang, Yujun
    Xie, Lei
    INTERSPEECH 2020, 2020, : 3436 - 3440
  • [24] Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition
    Liu, Bin
    Nie, Shuai
    Liang, Shan
    Liu, Wenju
    Yu, Meng
    Chen, Lianwu
    Peng, Shouye
    Li, Changliang
    INTERSPEECH 2019, 2019, : 491 - 495
  • [25] Towards a Method for end-to-end SDN App Development
    Stritzke, Christian
    Priesterjahn, Claudia
    Aranda Gutierrez, Pedro A.
    2015 FOURTH EUROPEAN WORKSHOP ON SOFTWARE DEFINED NETWORKS - EWSDN 2015, 2015, : 107 - 108
  • [26] Towards Paralinguistic-Only Speech Representations for End-to-End Speech Emotion Recognition
    Ioannides, Georgios
    Owen, Michael
    Fletcher, Andrew
    Rozgic, Viktor
    Wang, Chao
    INTERSPEECH 2023, 2023, : 1853 - 1857
  • [27] A Flow Aggregation Method Based on End-to-End Delay in SDN
    Kosugiyama, Takuya
    Tanabe, Kazuki
    Nakayama, Hiroki
    Hayashi, Tsunemasa
    Yamaoka, Katsunori
    2017 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2017,
  • [28] Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks
    Zhang, Ying
    Pezeshki, Mohammad
    Brakel, Philemon
    Zhang, Saizheng
    Laurent, Cesar
    Bengio, Yoshua
    Courville, Aaron
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 410 - 414
  • [29] Exploring end-to-end framework towards Khasi speech recognition system
    Bronson Syiem
    L. Joyprakash Singh
    International Journal of Speech Technology, 2021, 24 : 419 - 424
  • [30] TutorNet: Towards Flexible Knowledge Distillation for End-to-End Speech Recognition
    Yoon, Ji Won
    Lee, Hyeonseung
    Kim, Hyung Yong
    Cho, Won Ik
    Kim, Nam Soo
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 (29) : 1626 - 1638