Integrating DNN-HMM Technique with Hierarchical Multi-layer Acoustic Model for Text-Dependent Speaker Verification

被引：9

作者：

Laskar, Mohammad Azharuddin ^{[1
]}

Laskar, Rabul Hussain ^{[1
]}

机构：

[1] Natl Inst Technol Silchar, Dept Elect & Commun Engn, Silchar 788010, Assam, India

来源：

CIRCUITS SYSTEMS AND SIGNAL PROCESSING | 2019年 / 38卷 / 08期

关键词：

Text-dependent speaker verification; DNN; HiLAM; DNN-HMM; NEURAL-NETWORKS; RECOGNITION;

D O I：

10.1007/s00034-019-01103-3

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Subspace techniques, such as i-vector/probabilistic linear discriminant analysis and joint factor analysis, have been the most commonly used techniques in the field of text-dependent speaker verification. These techniques, however, do not model the temporal structure of the pass-phrase which otherwise is an important cue in the context of text-dependent speaker verification. The hierarchical multi-layer acoustic model (HiLAM) uses Gaussian mixture model (GMM)-hidden Markov model (HMM) technique, which also accounts for the temporal information of the pass-phrase. Owing to its contextual information modeling, HiLAM has been found to outperform the subspace techniques. In this paper, we propose integrating DNN-HMM technique with HiLAM to further improve the system performance. Firstly, an attempt has been made to define a speaker-text unit/class that could characterize the speaker idiosyncrasies, which are known to be associated with shorter and more fundamental units of speech text. To this end, HiLAM is used to propose a new class definition, and the training data is aligned with respect to this class definition. The labeled data is then used to discriminatively train a deep neural network (DNN). The new method of alignment enables the neural network to learn the actual context of the pass-phrase components. This is not the case with DNN trained in automatic speech recognition fashion. Besides, the network also models the speaker idiosyncrasies associated with specific and finer text units. The use of DNN posteriors to replace the GMM likelihood probabilities of HiLAM has led to significant improvement in performance over the baseline HiLAM system. Relative EER reduction of up to 36.58% has been observed on Part 1 of RSR2015 database.

引用

页码：3548 / 3572

页数：25

共 21 条

[1] Integrating DNN–HMM Technique with Hierarchical Multi-layer Acoustic Model for Text-Dependent Speaker Verification
Mohammad Azharuddin Laskar
Rabul Hussain Laskar
Circuits, Systems, and Signal Processing, 2019, 38 : 3548 - 3572
[2] Unsupervised Learning of HMM Topology for Text-dependent Speaker Verification
Liu, Ming
Huang, Thomas
INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 921 - 924
[3] DNN BASED SPEAKER EMBEDDING USING CONTENT INFORMATION FOR TEXT-DEPENDENT SPEAKER VERIFICATION
Dey, Subhadeep
Koshinaka, Takafumi
Motlicek, Petr
Madikeri, Srikanth
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5344 - 5348
[4] An alternative normalization scheme in HMM-based text-dependent speaker verification
Charlet, D
Jouvet, D
Collin, O
SPEECH COMMUNICATION, 2000, 31 (2-3) : 113 - 120
[5] Multi-Task Learning for Text-dependent Speaker Verification
Chen, Nanxin
Qian, Yanmin
Yu, Kai
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 185 - 189
[6] Joint Training of Expanded End-to-end DNN for Text-dependent Speaker Verification
Heo, Hee-soo
Jung, Jee-weon
Yang, Il-ho
Yoon, Sung-hyun
Yu, Ha-jin
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1532 - 1536
[7] i-vector/HMM Based Text-dependent Speaker Verification System for RedDots Challenge
Zeinali, Hossein
Sameti, Hossein
Burget, Lukas
Cernock, Jan
Maghsoodi, Nooshin
Matejka, Pavel
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 440 - 444
[8] Spectral subtraction and RASTA-filtering in text-dependent HMM-based speaker verification
Hardt, D
Fellbaum, K
1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 867 - 870
[9] Model selection and score normalization for text-dependent single utterance speaker verification
Buyuk, Osman
Arslan, Mustafa Levent
TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2012, 20 : 1277 - 1295
[10] Integrating Online i-vector into GMM-UBM for Text-dependent Speaker Verification
Jiang, Xiaowei
Wang, Shuai
Xiang, Xu
Qian, Yanmin
2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 1628 - 1632

← 1 2 3 →