CONVOLUTION-BASED ATTENTION MODEL WITH POSITIONAL ENCODING FOR STREAMING SPEECH RECOGNITION ON EMBEDDED DEVICES

被引：4

作者：

Park, Jinhwan ^{[1
]}

Kim, Chanwoo ^{[2
]}

Sung, Wonyong ^{[1
]}

机构：

[1] Seoul Natl Univ, Seoul, South Korea

[2] Samsung Res, Seoul, South Korea

来源：

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年

基金：

新加坡国家研究基金会;

关键词：

D O I：

10.1109/SLT48900.2021.9383583

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

On-device automatic speech recognition (ASR) is much more preferred over server-based implementations owing to its low latency and privacy protection. Many server-based ASRs employ recurrent neural networks (RNNs) to exploit their ability to recognize long sequences with a limited number of states; however, they are inefficient for single-stream implementations in embedded devices. In this study, a highly efficient convolutional model-based ASR with monotonic chunkwise attention is developed. Although temporal convolution-based models allow more efficient implementations, they demand a long filter-length to avoid looping or skipping problems. To remedy this problem, we add positional encoding, while shortening the filter length, to a convolution-based ASR encoder. It is demonstrated that the accuracy of the short filter-length convolutional model is significantly improved. In addition, the effect of positional encoding is analyzed by visualizing the attention energy and encoder outputs. The proposed model achieves the word error rate of 11.20% on TED-LIUMv2 for an end-to-end speech recognition task.

引用

页码：30 / 37

页数：8

共 50 条

[1] Color component marking and convolution-based encoding for polychromatic pattern recognition
Deng, Xiaopeng
Zhao, Daomu
OPTICS AND LASER TECHNOLOGY, 2011, 43 (08): : 1495 - 1498
[2] Mixture Model Attention: Flexible Streaming and Non-Streaming Automatic Speech Recognition
Audhkhasi, Kartik
Chen, Tongzhou
Ramabhadran, Bhuvana
Moreno, Pedro J.
INTERSPEECH 2021, 2021, : 1812 - 1816
[3] ATTENTION BASED ON-DEVICE STREAMING SPEECH RECOGNITION WITH LARGE SPEECH CORPUS
Kim, Kwangyoun
Lee, Kyungmin
Gowda, Dhananjaya
Park, Junmo
Kim, Sungsoo
Jin, Sichen
Lee, Young-Yoon
Yeo, Jinsu
Kim, Daehyun
Jung, Seokyeong
Lee, Jungin
Han, Myoungji
Kim, Chanwoo
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 956 - 963
[4] Graph Convolution-Based Deep Clustering for Speech Separation
Qin, Shan
Jiang, Ting
Wu, Sheng
Wang, Ning
Zhao, Xinran
IEEE ACCESS, 2020, 8 : 82571 - 82580
[5] Convolution-Based Neural Attention With Applications to Sentiment Classification
Du, Jiachen
Gui, Lin
He, Yulan
Xu, Ruifeng
Wang, Xuan
IEEE ACCESS, 2019, 7 : 27983 - 27992
[6] RepGCN: A Novel Graph Convolution-Based Model for Gait Recognition with Accompanying Behaviors
Mei, Zijie
Mei, Zhanyong
Tong, He
Yi, Sijia
Zeng, Hui
Li, Yingyi
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT V, 2024, 14429 : 147 - 158
[7] Convolution-Embedded Vision Transformer With Elastic Positional Encoding for Pansharpening
Wang, Nan
Meng, Xiangjun
Meng, Xiangchao
Shao, Feng
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
[8] Relative Positional Encoding for Speech Recognition and Direct Translation
Pham, Ngoc-Quan
Ha, Thanh-Le
Nguyen, Tuan-Nam
Nguyen, Thai-Son
Salesky, Elizabeth
Stuker, Sebastian
Niehues, Jan
Waibel, Alex
INTERSPEECH 2020, 2020, : 31 - 35
[9] Accelerating Convolution-based Detection Model on GPU
Liu, Qi
Ruang, Zi
Ru, Fuqiao
PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ESTIMATION, DETECTION AND INFORMATION FUSION ICEDIF 2015, 2015, : 61 - 66
[10] STREAMING TRANSFORMER TRANSDUCER BASED SPEECH RECOGNITION USING NON-CAUSAL CONVOLUTION
Shi, Yangyang
Wu, Chunyang
Wang, Dilin
Xiao, Alex
Mahadeokar, Jay
Zhang, Xiaohui
Liu, Chunxi
Li, Ke
Shangguan, Yuan
Nagaraja, Varun
Kalinli, Ozlem
Seltzer, Mike
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8277 - 8281

← 1 2 3 4 5 →