CONVOLUTION-BASED ATTENTION MODEL WITH POSITIONAL ENCODING FOR STREAMING SPEECH RECOGNITION ON EMBEDDED DEVICES

被引:4
|
作者
Park, Jinhwan [1 ]
Kim, Chanwoo [2 ]
Sung, Wonyong [1 ]
机构
[1] Seoul Natl Univ, Seoul, South Korea
[2] Samsung Res, Seoul, South Korea
来源
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年
基金
新加坡国家研究基金会;
关键词
D O I
10.1109/SLT48900.2021.9383583
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
On-device automatic speech recognition (ASR) is much more preferred over server-based implementations owing to its low latency and privacy protection. Many server-based ASRs employ recurrent neural networks (RNNs) to exploit their ability to recognize long sequences with a limited number of states; however, they are inefficient for single-stream implementations in embedded devices. In this study, a highly efficient convolutional model-based ASR with monotonic chunkwise attention is developed. Although temporal convolution-based models allow more efficient implementations, they demand a long filter-length to avoid looping or skipping problems. To remedy this problem, we add positional encoding, while shortening the filter length, to a convolution-based ASR encoder. It is demonstrated that the accuracy of the short filter-length convolutional model is significantly improved. In addition, the effect of positional encoding is analyzed by visualizing the attention energy and encoder outputs. The proposed model achieves the word error rate of 11.20% on TED-LIUMv2 for an end-to-end speech recognition task.
引用
收藏
页码:30 / 37
页数:8
相关论文
共 50 条
  • [1] Color component marking and convolution-based encoding for polychromatic pattern recognition
    Deng, Xiaopeng
    Zhao, Daomu
    OPTICS AND LASER TECHNOLOGY, 2011, 43 (08): : 1495 - 1498
  • [2] Mixture Model Attention: Flexible Streaming and Non-Streaming Automatic Speech Recognition
    Audhkhasi, Kartik
    Chen, Tongzhou
    Ramabhadran, Bhuvana
    Moreno, Pedro J.
    INTERSPEECH 2021, 2021, : 1812 - 1816
  • [3] ATTENTION BASED ON-DEVICE STREAMING SPEECH RECOGNITION WITH LARGE SPEECH CORPUS
    Kim, Kwangyoun
    Lee, Kyungmin
    Gowda, Dhananjaya
    Park, Junmo
    Kim, Sungsoo
    Jin, Sichen
    Lee, Young-Yoon
    Yeo, Jinsu
    Kim, Daehyun
    Jung, Seokyeong
    Lee, Jungin
    Han, Myoungji
    Kim, Chanwoo
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 956 - 963
  • [4] Graph Convolution-Based Deep Clustering for Speech Separation
    Qin, Shan
    Jiang, Ting
    Wu, Sheng
    Wang, Ning
    Zhao, Xinran
    IEEE ACCESS, 2020, 8 : 82571 - 82580
  • [5] Convolution-Based Neural Attention With Applications to Sentiment Classification
    Du, Jiachen
    Gui, Lin
    He, Yulan
    Xu, Ruifeng
    Wang, Xuan
    IEEE ACCESS, 2019, 7 : 27983 - 27992
  • [6] RepGCN: A Novel Graph Convolution-Based Model for Gait Recognition with Accompanying Behaviors
    Mei, Zijie
    Mei, Zhanyong
    Tong, He
    Yi, Sijia
    Zeng, Hui
    Li, Yingyi
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT V, 2024, 14429 : 147 - 158
  • [7] Convolution-Embedded Vision Transformer With Elastic Positional Encoding for Pansharpening
    Wang, Nan
    Meng, Xiangjun
    Meng, Xiangchao
    Shao, Feng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [8] Relative Positional Encoding for Speech Recognition and Direct Translation
    Pham, Ngoc-Quan
    Ha, Thanh-Le
    Nguyen, Tuan-Nam
    Nguyen, Thai-Son
    Salesky, Elizabeth
    Stuker, Sebastian
    Niehues, Jan
    Waibel, Alex
    INTERSPEECH 2020, 2020, : 31 - 35
  • [9] Accelerating Convolution-based Detection Model on GPU
    Liu, Qi
    Ruang, Zi
    Ru, Fuqiao
    PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ESTIMATION, DETECTION AND INFORMATION FUSION ICEDIF 2015, 2015, : 61 - 66
  • [10] STREAMING TRANSFORMER TRANSDUCER BASED SPEECH RECOGNITION USING NON-CAUSAL CONVOLUTION
    Shi, Yangyang
    Wu, Chunyang
    Wang, Dilin
    Xiao, Alex
    Mahadeokar, Jay
    Zhang, Xiaohui
    Liu, Chunxi
    Li, Ke
    Shangguan, Yuan
    Nagaraja, Varun
    Kalinli, Ozlem
    Seltzer, Mike
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8277 - 8281