Integrating DNN-HMM Technique with Hierarchical Multi-layer Acoustic Model for Text-Dependent Speaker Verification

被引:9
|
作者
Laskar, Mohammad Azharuddin [1 ]
Laskar, Rabul Hussain [1 ]
机构
[1] Natl Inst Technol Silchar, Dept Elect & Commun Engn, Silchar 788010, Assam, India
关键词
Text-dependent speaker verification; DNN; HiLAM; DNN-HMM; NEURAL-NETWORKS; RECOGNITION;
D O I
10.1007/s00034-019-01103-3
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Subspace techniques, such as i-vector/probabilistic linear discriminant analysis and joint factor analysis, have been the most commonly used techniques in the field of text-dependent speaker verification. These techniques, however, do not model the temporal structure of the pass-phrase which otherwise is an important cue in the context of text-dependent speaker verification. The hierarchical multi-layer acoustic model (HiLAM) uses Gaussian mixture model (GMM)-hidden Markov model (HMM) technique, which also accounts for the temporal information of the pass-phrase. Owing to its contextual information modeling, HiLAM has been found to outperform the subspace techniques. In this paper, we propose integrating DNN-HMM technique with HiLAM to further improve the system performance. Firstly, an attempt has been made to define a speaker-text unit/class that could characterize the speaker idiosyncrasies, which are known to be associated with shorter and more fundamental units of speech text. To this end, HiLAM is used to propose a new class definition, and the training data is aligned with respect to this class definition. The labeled data is then used to discriminatively train a deep neural network (DNN). The new method of alignment enables the neural network to learn the actual context of the pass-phrase components. This is not the case with DNN trained in automatic speech recognition fashion. Besides, the network also models the speaker idiosyncrasies associated with specific and finer text units. The use of DNN posteriors to replace the GMM likelihood probabilities of HiLAM has led to significant improvement in performance over the baseline HiLAM system. Relative EER reduction of up to 36.58% has been observed on Part 1 of RSR2015 database.
引用
收藏
页码:3548 / 3572
页数:25
相关论文
共 21 条
  • [21] Two Methods for Spoofing-Aware Speaker Verification: Multi-Layer Perceptron Score Fusion Model and Integrated Embedding Projector
    Heo, Jungwoo
    Kim, Ju-ho
    Shin, Hyun-seo
    INTERSPEECH 2022, 2022, : 2878 - 2882