Deep Learning-Based End-to-End Speaker Identification Using Time-Frequency Representation of Speech Signal

被引:5
|
作者
Saritha, Banala [1 ]
Laskar, Mohammad Azharuddin [1 ]
Kirupakaran, Anish Monsley [1 ]
Laskar, Rabul Hussain [1 ]
Choudhury, Madhuchhanda [1 ]
Shome, Nirupam [2 ]
机构
[1] Natl Inst Technol Silchar, Dept Elect & Commun Engn, Silchar, Assam, India
[2] Assam Univ, Dept Elect & Commun Engn, Silchar, Assam, India
关键词
Spectrogram; Log Mel spectrogram; Cochleagram; Deep convolutional neural network; Speaker identification; End-to-end system; FEATURES;
D O I
10.1007/s00034-023-02542-9
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Speech-based speaker identification system is one of the alternatives to the conventional biometric contact-based identification systems. Recent works demonstrate the growing interest among researchers in this field and highlight the practical usability of speech for speaker identification across various applications. In this work, we try to address the limitations in the existing state-of-the-art approaches and highlight the usability of convolutional neural networks for speaker identification systems. The present work examines the usage of spectrogram as an input to these spatial networks and its robustness in the presence of noise. For faster training (computation) and to reduce the memory requirement (storage), SpectroNet model for speech-based speaker identification is introduced in this work. Evaluation of the proposed system has been done using Voxceleb1 and Part1 of the RSR 2015 databases. Experimental results show a relative improvement of similar to 16% (accuracy-96.21%) with spectrogram and similar to 10% (accuracy-98.92%) with log Mel spectrogram in identifying the speaker compared to the existing models. When cochleagram was used, it results in an identification accuracy of 99.26%. Analyzing the result obtained shows the applicability of the proposed approach in situations where (i) minimal speech data are available for speaker identification; (ii) speech data are noisy in nature.
引用
收藏
页码:1839 / 1861
页数:23
相关论文
共 50 条
  • [1] Deep Learning-Based End-to-End Speaker Identification Using Time–Frequency Representation of Speech Signal
    Banala Saritha
    Mohammad Azharuddin Laskar
    Anish Monsley Kirupakaran
    Rabul Hussain Laskar
    Madhuchhanda Choudhury
    Nirupam Shome
    Circuits, Systems, and Signal Processing, 2024, 43 : 1839 - 1861
  • [2] End-to-End Speech Separation Using Orthogonal Representation in Complex and Real Time-Frequency Domain
    Wang, Kai
    Huang, Hao
    Hu, Ying
    Huang, Zhihua
    Li, Sheng
    INTERSPEECH 2021, 2021, : 3046 - 3050
  • [3] End-to-end speech recognition from raw speech: Multi time-frequency resolution CNN architecture for efficient representation learning
    Eledath, Dhanya
    Inbarajan, P.
    Biradar, Anurag
    Mahadeva, Sathwick
    Ramasubramanian, V
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 536 - 540
  • [4] Speech Vision: An End-to-End Deep Learning-Based Dysarthric Automatic Speech Recognition System
    Shahamiri, Seyed Reza
    IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, 2021, 29 : 852 - 861
  • [5] END-TO-END DEEP LEARNING-BASED ADAPTATION CONTROL FOR FREQUENCY-DOMAIN ADAPTIVE SYSTEM IDENTIFICATION
    Haubner, Thomas
    Brendel, Andreas
    Kellermann, Walter
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 766 - 770
  • [6] Deep Learning-Based End-to-End Carrier Signal Detection in Broadband Power Spectrum
    Huang, Hao
    Wang, Peng
    Wang, Jiao
    Li, Jianqing
    ELECTRONICS, 2022, 11 (12)
  • [7] End-To-End Speech Emotion Recognition Based on Time and Frequency Information Using Deep Neural Networks
    Bakhshi, Ali
    Wong, Aaron S. W.
    Chalup, Stephan
    ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 : 969 - 975
  • [8] End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning
    Denisov, Pavel
    Ngoc Thang Vu
    INTERSPEECH 2019, 2019, : 4425 - 4429
  • [9] Arabic speech recognition using end-to-end deep learning
    Alsayadi, Hamzah A.
    Abdelhamid, Abdelaziz A.
    Hegazy, Islam
    Fayed, Zaki T.
    IET SIGNAL PROCESSING, 2021, 15 (08) : 521 - 534
  • [10] Deep End-to-End Representation Learning for Food Type Recognition from Speech
    Sertolli, Benjamin
    Cummins, Nicholas
    Sengur, Abdulkadir
    Schuller, Bjorn W.
    ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 574 - 578