Deep Learning-Based End-to-End Speaker Identification Using Time-Frequency Representation of Speech Signal

被引：5

作者：

Saritha, Banala ^{[1
]}

Laskar, Mohammad Azharuddin ^{[1
]}

Kirupakaran, Anish Monsley ^{[1
]}

Laskar, Rabul Hussain ^{[1
]}

Choudhury, Madhuchhanda ^{[1
]}

Shome, Nirupam ^{[2
]}

机构：

[1] Natl Inst Technol Silchar, Dept Elect & Commun Engn, Silchar, Assam, India

[2] Assam Univ, Dept Elect & Commun Engn, Silchar, Assam, India

来源：

CIRCUITS SYSTEMS AND SIGNAL PROCESSING | 2023年 / 43卷 / 3期

关键词：

Spectrogram; Log Mel spectrogram; Cochleagram; Deep convolutional neural network; Speaker identification; End-to-end system; FEATURES;

D O I：

10.1007/s00034-023-02542-9

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Speech-based speaker identification system is one of the alternatives to the conventional biometric contact-based identification systems. Recent works demonstrate the growing interest among researchers in this field and highlight the practical usability of speech for speaker identification across various applications. In this work, we try to address the limitations in the existing state-of-the-art approaches and highlight the usability of convolutional neural networks for speaker identification systems. The present work examines the usage of spectrogram as an input to these spatial networks and its robustness in the presence of noise. For faster training (computation) and to reduce the memory requirement (storage), SpectroNet model for speech-based speaker identification is introduced in this work. Evaluation of the proposed system has been done using Voxceleb1 and Part1 of the RSR 2015 databases. Experimental results show a relative improvement of similar to 16% (accuracy-96.21%) with spectrogram and similar to 10% (accuracy-98.92%) with log Mel spectrogram in identifying the speaker compared to the existing models. When cochleagram was used, it results in an identification accuracy of 99.26%. Analyzing the result obtained shows the applicability of the proposed approach in situations where (i) minimal speech data are available for speaker identification; (ii) speech data are noisy in nature.

引用

页码：1839 / 1861

页数：23

共 50 条

[1] Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal
Banala Saritha
Mohammad Azharuddin Laskar
Anish Monsley Kirupakaran
Rabul Hussain Laskar
Madhuchhanda Choudhury
Nirupam Shome
Circuits, Systems, and Signal Processing, 2024, 43 : 1839 - 1861
[2] End-to-End Speech Separation Using Orthogonal Representation in Complex and Real Time-Frequency Domain
Wang, Kai
Huang, Hao
Hu, Ying
Huang, Zhihua
Li, Sheng
INTERSPEECH 2021, 2021, : 3046 - 3050
[3] End-to-end speech recognition from raw speech: Multi time-frequency resolution CNN architecture for efficient representation learning
Eledath, Dhanya
Inbarajan, P.
Biradar, Anurag
Mahadeva, Sathwick
Ramasubramanian, V
29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 536 - 540
[4] Speech Vision: An End-to-End Deep Learning-Based Dysarthric Automatic Speech Recognition System
Shahamiri, Seyed Reza
IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, 2021, 29 : 852 - 861
[5] END-TO-END DEEP LEARNING-BASED ADAPTATION CONTROL FOR FREQUENCY-DOMAIN ADAPTIVE SYSTEM IDENTIFICATION
Haubner, Thomas
Brendel, Andreas
Kellermann, Walter
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 766 - 770
[6] Deep Learning-Based End-to-End Carrier Signal Detection in Broadband Power Spectrum
Huang, Hao
Wang, Peng
Wang, Jiao
Li, Jianqing
ELECTRONICS, 2022, 11 (12)
[7] End-To-End Speech Emotion Recognition Based on Time and Frequency Information Using Deep Neural Networks
Bakhshi, Ali
Wong, Aaron S. W.
Chalup, Stephan
ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 : 969 - 975
[8] End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning
Denisov, Pavel
Ngoc Thang Vu
INTERSPEECH 2019, 2019, : 4425 - 4429
[9] Arabic speech recognition using end-to-end deep learning
Alsayadi, Hamzah A.
Abdelhamid, Abdelaziz A.
Hegazy, Islam
Fayed, Zaki T.
IET SIGNAL PROCESSING, 2021, 15 (08) : 521 - 534
[10] Deep End-to-End Representation Learning for Food Type Recognition from Speech
Sertolli, Benjamin
Cummins, Nicholas
Sengur, Abdulkadir
Schuller, Bjorn W.
ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 574 - 578

← 1 2 3 4 5 →