Improving speech command recognition through decision-level fusion of deep filtered speech cues

被引:5
|
作者
Mehra, Sunakshi [1 ]
Ranga, Virender [1 ]
Agarwal, Ritu [1 ]
机构
[1] Delhi Technol Univ, Dept Informat Technol, Delhi, India
关键词
Speech filtering techniques; Swin-tiny transformer; Feed-forward neural network (FNN); Speech command recognition; ENHANCEMENT;
D O I
10.1007/s11760-023-02845-z
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Living beings communicate through speech, which can be analysed to identify words and sentences by recognizing the flow of spoken utterances. However, background noise will always have an impact on the speech recognition process. The detection rate in the presence of background noise is still unsatisfactory, necessitating further research and potential remedies in the speech recognition process. To improve the noisy speech information, this research suggests speech recognition based on a combination of median filtering and adaptive filtering. In this study, speech command recognition is achieved by employing popular noise reduction techniques and utilizing two parallel channels of filtered speech independently. The procedure involves five steps: firstly, enhancing signals using two parallel independent speech enhancement models (median and adaptive filtering); secondly, extracting 2D Mel spectrogram images from the enhanced signals; and thirdly, passing the 2-dimensional Mel spectrogram images to the tiny Swin Transformer for classification. The classification is performed among the large-scale ImageNet dataset, which consists of 14 million images and is approximately 150 GB in size. Fourth, the posterior probabilities extracted from the tiny Swin Transformer modelling are then fed into our proposed 3-layered feed-forward network for classification among our 10-speech command categories. Lastly, decision-level fusion is applied to the two parallel, independent channels obtained from the 3-layered feed-forward network. For experimentation, the Google Speech Command dataset version 2 is used. We obtained a test accuracy of 99.85% when compared with other state-of-the-art methods, demonstrating satisfactory results that can be reported.
引用
收藏
页码:1365 / 1373
页数:9
相关论文
共 50 条
  • [1] Improving speech command recognition through decision-level fusion of deep filtered speech cues
    Sunakshi Mehra
    Virender Ranga
    Ritu Agarwal
    Signal, Image and Video Processing, 2024, 18 : 1365 - 1373
  • [2] Deep fusion framework for speech command recognition using acoustic and linguistic features
    Mehra, Sunakshi
    Susan, Seba
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (25) : 38667 - 38691
  • [3] Deep fusion framework for speech command recognition using acoustic and linguistic features
    Sunakshi Mehra
    Seba Susan
    Multimedia Tools and Applications, 2023, 82 : 38667 - 38691
  • [4] Speech Command Recognition Using Deep Learning
    Ayache, Mohammad
    Kanaan, Hussien
    Kassir, Kawthar
    Kassir, Yasser
    2021 SIXTH INTERNATIONAL CONFERENCE ON ADVANCES IN BIOMEDICAL ENGINEERING (ICABME), 2021, : 24 - 29
  • [5] Gait Recognition System using Decision-Level Fusion
    Lee, Byungyun
    Hong, Sungjun
    Lee, Heesung
    Kim, Euntai
    ICIEA 2010: PROCEEDINGS OF THE 5TH IEEE CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS, VOL 1, 2010, : 336 - 339
  • [6] Classical and Deep Learning Methods for Speech Command Recognition
    Xie, Jie
    Li, Qijing
    Hu, Kai
    Zhu, Mingying
    2021 IEEE 9TH INTERNATIONAL CONFERENCE ON INFORMATION, COMMUNICATION AND NETWORKS (ICICN 2021), 2021, : 41 - 45
  • [7] Decision-level fusion approach to face recognition with multiple cameras
    Yeom, Seokwon
    MOBILE MULTIMEDIA/IMAGE PROCESSING, SECURITY, AND APPLICATIONS 2014, 2014, 9120
  • [8] Decision-Level Fusion of Infrared and Visible images for Face Recognition
    Zhao, Yunfeng
    Yin, Yixin
    Fu, Dongmei
    2008 CHINESE CONTROL AND DECISION CONFERENCE, VOLS 1-11, 2008, : 2411 - 2414
  • [9] Decision Level Fusion for Audio-Visual Speech Recognition in Noisy Conditions
    Sad, Gonzalo D.
    Terissi, Lucas D.
    Gomez, Juan C.
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2016, 2017, 10125 : 360 - 367
  • [10] Decision-Level Fusion Method for Emotion Recognition using Multimodal Emotion Recognition Information
    Song, Kyu-Seob
    Nho, Young-Hoon
    Seo, Ju-Hwan
    Kwon, Dong-Soo
    2018 15TH INTERNATIONAL CONFERENCE ON UBIQUITOUS ROBOTS (UR), 2018, : 472 - 476