A Chinese acoustic model based on convolutional neural network

被引:3
|
作者
Zhang, Qiang [1 ,2 ]
Sang, Jun [1 ,2 ]
Alam, Mohammad S. [3 ]
Cai, Bin [1 ,2 ]
Yang, Li [1 ,2 ]
机构
[1] Chongqing Univ, Minist Educ, Key Lab Dependable Serv Comp Cyber Phys Soc, Chongqing 400044, Peoples R China
[2] Chongqing Univ, Sch Big Data & Software Engn, Chongqing 401331, Peoples R China
[3] Texas A&M Univ Kingsville, Frank H Dotterweich Coll Engn, Kingsville, TX 78363 USA
来源
关键词
Speech recognition; acoustic model; Chinese; convolutional neural network; connectionist temporal classification (CTC);
D O I
10.1117/12.2520356
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech recognition has always been one of the research focuses in the field of human-computer communication and interaction. The main purpose of automatic speech recognition (ASR) is to convert speech waveform signals into text. Acoustic model is the main component of ASR, which is used to connect the observation features of speech signals with the speech modeling units. In recent years, deep learning has become the mainstream technology in the field of speech recognition. In this paper, a convolutional neural network architecture composed of VGG and Connectionist Temporal Classification (CTC) loss function was proposed for speech recognition acoustic model. Traditional acoustic model training is based on frame-level labels with cross-entropy criterion, which requires a tedious label alignment procedure. The CTC loss was adopted to automatically learn the alignments between speech frames and label sequences, such that the training process is end-to-end. The architecture can exploit temporal and spectral structures of speech signals simultaneously. Batch normalization (BN) technique was used for normalizing each layers input to reduce internal covariance shift. To prevent overfitting, dropout technique was used during training to improve network generalization ability. The speech signal was transformed into a spectral image through a series of processing to be the input of the neural network. The input feature is 200 dimensions, and output labels of acoustic mode is 415 Chinese pronunciation without pitch. The experimental results demonstrated that the proposed model achieves the Character error rate (CER) of 17.97% and 23.86% on public Mandarin speech corpus, AISHELL-1 and ST-CMDS-20170001_1, respectively.
引用
收藏
页数:7
相关论文
共 50 条
  • [21] A Convolutional Neural Network Model for Non-factoid Chinese Answer Selection
    Gao, Xiang
    Niu, Kai
    He, Zhiqiang
    2017 2ND IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND APPLICATIONS (ICCIA), 2017, : 270 - 274
  • [22] Acoustic Emission Recognition Based on a Two-Streams Convolutional Neural Network
    Yang, Weibo
    Liu, Weidong
    Liu, Jinming
    Zhang, Mingyang
    CMC-COMPUTERS MATERIALS & CONTINUA, 2020, 64 (01): : 515 - 525
  • [23] Convolutional neural network-based fracture detection in spectrogram of acoustic emission
    Monika, R.
    Deivalakshmi, S.
    SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (05) : 4059 - 4074
  • [24] A New Unsupervised Convolutional Neural Network Model for Chinese Scene Text Detection
    Ren, Xiaohang
    Chen, Kai
    Yang, Xiaokang
    Zhou, Yi
    He, Jianhua
    Sun, Jun
    2015 IEEE CHINA SUMMIT & INTERNATIONAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING, 2015, : 428 - 432
  • [25] Acoustic emission recognition based on a two-streams convolutional neural network
    Yang W.
    Liu W.
    Liu J.
    Zhang M.
    Computers, Materials and Continua, 2020, 64 (01): : 515 - 525
  • [26] Multi-Scale Acoustic Velocity Inversion Based on a Convolutional Neural Network
    Li, Wenda
    Wu, Tianqi
    Liu, Hong
    REMOTE SENSING, 2024, 16 (05)
  • [27] An Improved Convolutional Neural Network for Pipe Leakage Identification Based on Acoustic Emission
    Xu, Weidong
    Huang, Jiwei
    Sun, Lianghui
    Yao, Yixin
    Zhu, Fan
    Xie, Yaoguo
    Zhang, Meng
    JOURNAL OF MARINE SCIENCE AND ENGINEERING, 2024, 12 (10)
  • [28] Traffic Sign Recognition Based on Convolutional Neural Network Model
    He, Zhilong
    Xiao, Zhongjun
    Yan, Zhiguo
    2020 CHINESE AUTOMATION CONGRESS (CAC 2020), 2020, : 155 - 158
  • [29] A convolutional neural network based model to predict nearshore waves and
    Wei, Zhangping
    Davison, Andrew
    COASTAL ENGINEERING, 2022, 171
  • [30] A Re-trained Model Based On Multi-kernel Convolutional Neural Network for Acoustic Scene Classification
    Tuan Nguyen
    Dat Ngo
    Lam Pham
    Linh Tran
    2020 RIVF INTERNATIONAL CONFERENCE ON COMPUTING & COMMUNICATION TECHNOLOGIES (RIVF 2020), 2020, : 7 - 11