Improving speech command recognition through decision-level fusion of deep filtered speech cues

被引:5
|
作者
Mehra, Sunakshi [1 ]
Ranga, Virender [1 ]
Agarwal, Ritu [1 ]
机构
[1] Delhi Technol Univ, Dept Informat Technol, Delhi, India
关键词
Speech filtering techniques; Swin-tiny transformer; Feed-forward neural network (FNN); Speech command recognition; ENHANCEMENT;
D O I
10.1007/s11760-023-02845-z
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Living beings communicate through speech, which can be analysed to identify words and sentences by recognizing the flow of spoken utterances. However, background noise will always have an impact on the speech recognition process. The detection rate in the presence of background noise is still unsatisfactory, necessitating further research and potential remedies in the speech recognition process. To improve the noisy speech information, this research suggests speech recognition based on a combination of median filtering and adaptive filtering. In this study, speech command recognition is achieved by employing popular noise reduction techniques and utilizing two parallel channels of filtered speech independently. The procedure involves five steps: firstly, enhancing signals using two parallel independent speech enhancement models (median and adaptive filtering); secondly, extracting 2D Mel spectrogram images from the enhanced signals; and thirdly, passing the 2-dimensional Mel spectrogram images to the tiny Swin Transformer for classification. The classification is performed among the large-scale ImageNet dataset, which consists of 14 million images and is approximately 150 GB in size. Fourth, the posterior probabilities extracted from the tiny Swin Transformer modelling are then fed into our proposed 3-layered feed-forward network for classification among our 10-speech command categories. Lastly, decision-level fusion is applied to the two parallel, independent channels obtained from the 3-layered feed-forward network. For experimentation, the Google Speech Command dataset version 2 is used. We obtained a test accuracy of 99.85% when compared with other state-of-the-art methods, demonstrating satisfactory results that can be reported.
引用
收藏
页码:1365 / 1373
页数:9
相关论文
共 50 条
  • [21] Computer-Aided Recognition Based on Decision-Level Multimodal Fusion for Depression
    Zhang, Bingtao
    Cai, Hanshu
    Song, Yubo
    Tao, Lei
    Li, Yanlin
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (07) : 3466 - 3477
  • [22] Automated recognition of construction worker activities using multimodal decision-level fusion
    Gong, Yue
    Seo, Joonoh
    Kang, Kyung-Su
    Shi, Mengnan
    AUTOMATION IN CONSTRUCTION, 2025, 172
  • [23] Object Recognition Based on the Context Aware Decision-Level Fusion in Multiviews Imagery
    Mahmoudi, Fatemeh Tabib
    Samadzadegan, Farhad
    Reinartz, Peter, Jr.
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2015, 8 (01) : 12 - 22
  • [24] Combining feature-level and decision-level fusion in a hierarchical classifier for emotion recognition in the wild
    Bo Sun
    Liandong Li
    Xuewen Wu
    Tian Zuo
    Ying Chen
    Guoyan Zhou
    Jun He
    Xiaoming Zhu
    Journal on Multimodal User Interfaces, 2016, 10 : 125 - 137
  • [25] Robust Face Recognition Using the Deep C2D-CNN Model Based on Decision-Level Fusion
    Li, Jing
    Qiu, Tao
    Wen, Chang
    Xie, Kai
    Wen, Fang-Qing
    SENSORS, 2018, 18 (07)
  • [26] Decision-Level Fusion Tracking for Infrared and Visible Spectra Based on Deep Learning
    Tang Cong
    Ling Yongshun
    Yang Hua
    Yang Xing
    Tong Wuqin
    LASER & OPTOELECTRONICS PROGRESS, 2019, 56 (07)
  • [27] Combining feature-level and decision-level fusion in a hierarchical classifier for emotion recognition in the wild
    Sun, Bo
    Li, Liandong
    Wu, Xuewen
    Zuo, Tian
    Chen, Ying
    Zhou, Guoyan
    He, Jun
    Zhu, Xiaoming
    JOURNAL ON MULTIMODAL USER INTERFACES, 2016, 10 (02) : 125 - 137
  • [28] Improving the Utility of Speech Recognition Through Error Detection
    Kimberly Voll
    Stella Atkins
    Bruce Forster
    Journal of Digital Imaging, 2008, 21
  • [29] Improving speech recognition learning through lazy training
    Rimer, ME
    Martinez, TR
    Wilson, DR
    PROCEEDING OF THE 2002 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-3, 2002, : 2568 - 2573
  • [30] Improving Speech Recognition Rate through Analysis Parameters
    Eringis, Deividas
    Tamulevicius, Gintautas
    ELECTRICAL CONTROL AND COMMUNICATION ENGINEERING, 2014, 5 (01) : 61 - 66