A High-Performance Neural Network SoC for End-to-End Speaker Verification

被引：0

作者：

Tsai, Tsung-Han ^{[1
]}

Chiang, Meng-Jui ^{[1
]}

机构：

[1] Natl Cent Univ, Dept Elect Engn, Taoyuan 32001, Taiwan

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Speaker verification (SV); speaker identification; x-vector; RISC-V; system-on-chip (SoC); GMM;

D O I：

10.1109/ACCESS.2024.3491780

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The use of the neural network to recognize a speaker's identity from their speech sounds has become popular in the last few years. Among these methods, the x-vector extractor, which is based on time-delay neural networks (TDNN), performs better in noise-canceling and generally achieves higher accuracy compared to previous methods such as the Gaussian mixture model (GMM) and the support vector machines (SVM). This paper presents a system-on-chip (SoC) composed of a RISC-V CPU and a neural network accelerator module for x-vector-based speaker verification (SV). To ensure real-time latency and enable the implementation of the system on edge devices, this work employs three steps for processing x-vector including size reduction, pruning, and compression. We are dedicated to optimizing the data flow with sparsity. Compared with the conventional sparse matrix compression method compressed sparse row (CSR), we propose the binary pointer compressed sparse row (BPCSR) method which significantly improves the latency and avoids the load balancing issue in each PE. We further design the neural network accelerator module that stores the compressed parameters and computes the x-vector extractor while the RISC-V CPU processes the rest of the calculations such as feature extraction and the classifier. The system was tested on the VoxCeleb dataset, containing 1251 test speakers, and achieved over 95% accuracy. Lastly, we synthesized the chip with TSMC 90 nm technology. It presents 15.5 mm2 in the area and 97.88 mW for real-time identification.

引用

页码：165482 / 165496

页数：15

共 50 条

[31] END-TO-END ATTENTION BASED TEXT-DEPENDENT SPEAKER VERIFICATION
Zhang, Shi-Xiong
Chen, Zhuo
Zhao, Yong
Li, Jinyu
Gong, Yifan
2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 171 - 178
[32] Robust End-to-end Speaker Diarization with Generic Neural Clustering
Yang, Chenyu
Wang, Yu
INTERSPEECH 2022, 2022, : 1471 - 1475
[33] END-TO-END NEURAL SPEAKER DIARIZATION WITH SELF-ATTENTION
Fujita, Yusuke
Kanda, Naoyuki
Horiguchi, Shota
Xue, Yawen
Nagamatsu, Kenji
Watanabe, Shinji
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 296 - 303
[34] End-to-End Audio-Visual Neural Speaker Diarization
He, Mao-kui
Du, Jun
Lee, Chin-Hui
INTERSPEECH 2022, 2022, : 1461 - 1465
[35] A Framework for End-to-End Simulation of High-performance Computing Systems
Denzel, Wolfgang E.
Li, Jian
Walker, Peter
Jin, Yuho
SIMULATION-TRANSACTIONS OF THE SOCIETY FOR MODELING AND SIMULATION INTERNATIONAL, 2010, 86 (5-6): : 331 - 350
[36] SIAMESE CAPSULE NETWORK FOR END-TO-END SPEAKER RECOGNITION IN THE WILD
Hajavi, Amirhossein
Etemad, Ali
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7203 - 7207
[37] Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor
Chen, Zhengyang
Han, Bing
Wang, Shuai
Qian, Yanmin
INTERSPEECH 2023, 2023, : 3552 - 3556
[38] TDMF: TASK-DRIVEN MULTILEVEL FRAMEWORK FOR END-TO-END SPEAKER VERIFICATION
Chen, Chen
Han, Jiqing
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6809 - 6813
[39] END-TO-END TEXT-INDEPENDENT SPEAKER VERIFICATION WITH FLEXIBILITY IN UTTERANCE DURATION
Zhang, Chunlei
Koishida, Kazuhito
2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 584 - 590
[40] End-To-End Neural Speaker Diarization Through Step-Function
Latypov, Rustam
Stolov, Evgeni
2021 IEEE 15TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT2021), 2021,

← 1 2 3 4 5 →