Improving speech command recognition through decision-level fusion of deep filtered speech cues

被引：5

作者：

Mehra, Sunakshi ^{[1
]}

Ranga, Virender ^{[1
]}

Agarwal, Ritu ^{[1
]}

机构：

[1] Delhi Technol Univ, Dept Informat Technol, Delhi, India

来源：

SIGNAL IMAGE AND VIDEO PROCESSING | 2024年 / 18卷 / 02期

关键词：

Speech filtering techniques; Swin-tiny transformer; Feed-forward neural network (FNN); Speech command recognition; ENHANCEMENT;

D O I：

10.1007/s11760-023-02845-z

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Living beings communicate through speech, which can be analysed to identify words and sentences by recognizing the flow of spoken utterances. However, background noise will always have an impact on the speech recognition process. The detection rate in the presence of background noise is still unsatisfactory, necessitating further research and potential remedies in the speech recognition process. To improve the noisy speech information, this research suggests speech recognition based on a combination of median filtering and adaptive filtering. In this study, speech command recognition is achieved by employing popular noise reduction techniques and utilizing two parallel channels of filtered speech independently. The procedure involves five steps: firstly, enhancing signals using two parallel independent speech enhancement models (median and adaptive filtering); secondly, extracting 2D Mel spectrogram images from the enhanced signals; and thirdly, passing the 2-dimensional Mel spectrogram images to the tiny Swin Transformer for classification. The classification is performed among the large-scale ImageNet dataset, which consists of 14 million images and is approximately 150 GB in size. Fourth, the posterior probabilities extracted from the tiny Swin Transformer modelling are then fed into our proposed 3-layered feed-forward network for classification among our 10-speech command categories. Lastly, decision-level fusion is applied to the two parallel, independent channels obtained from the 3-layered feed-forward network. For experimentation, the Google Speech Command dataset version 2 is used. We obtained a test accuracy of 99.85% when compared with other state-of-the-art methods, demonstrating satisfactory results that can be reported.

引用

页码：1365 / 1373

页数：9

共 50 条

[21] Computer-Aided Recognition Based on Decision-Level Multimodal Fusion for Depression
Zhang, Bingtao
Cai, Hanshu
Song, Yubo
Tao, Lei
Li, Yanlin
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (07) : 3466 - 3477
[22] Automated recognition of construction worker activities using multimodal decision-level fusion
Gong, Yue
Seo, Joonoh
Kang, Kyung-Su
Shi, Mengnan
AUTOMATION IN CONSTRUCTION, 2025, 172
[23] Object Recognition Based on the Context Aware Decision-Level Fusion in Multiviews Imagery
Mahmoudi, Fatemeh Tabib
Samadzadegan, Farhad
Reinartz, Peter, Jr.
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2015, 8 (01) : 12 - 22
[24] Combining feature-level and decision-level fusion in a hierarchical classifier for emotion recognition in the wild
Bo Sun
Liandong Li
Xuewen Wu
Tian Zuo
Ying Chen
Guoyan Zhou
Jun He
Xiaoming Zhu
Journal on Multimodal User Interfaces, 2016, 10 : 125 - 137
[25] Robust Face Recognition Using the Deep C2D-CNN Model Based on Decision-Level Fusion
Li, Jing
Qiu, Tao
Wen, Chang
Xie, Kai
Wen, Fang-Qing
SENSORS, 2018, 18 (07)
[26] Decision-Level Fusion Tracking for Infrared and Visible Spectra Based on Deep Learning
Tang Cong
Ling Yongshun
Yang Hua
Yang Xing
Tong Wuqin
LASER & OPTOELECTRONICS PROGRESS, 2019, 56 (07)
[27] Combining feature-level and decision-level fusion in a hierarchical classifier for emotion recognition in the wild
Sun, Bo
Li, Liandong
Wu, Xuewen
Zuo, Tian
Chen, Ying
Zhou, Guoyan
He, Jun
Zhu, Xiaoming
JOURNAL ON MULTIMODAL USER INTERFACES, 2016, 10 (02) : 125 - 137
[28] Improving the Utility of Speech Recognition Through Error Detection
Kimberly Voll
Stella Atkins
Bruce Forster
Journal of Digital Imaging, 2008, 21
[29] Improving speech recognition learning through lazy training
Rimer, ME
Martinez, TR
Wilson, DR
PROCEEDING OF THE 2002 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-3, 2002, : 2568 - 2573
[30] Improving Speech Recognition Rate through Analysis Parameters
Eringis, Deividas
Tamulevicius, Gintautas
ELECTRICAL CONTROL AND COMMUNICATION ENGINEERING, 2014, 5 (01) : 61 - 66

← 1 2 3 4 5 →