A robust visual feature extraction based BTSM-LDA for audio-visual speech recognition

被引:0
|
作者
Lv, Guoyun [1 ]
Zhao, Rongchun [1 ]
Jiang, Dongmei [1 ]
Li, Yan [1 ]
Sahli, H. [2 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China
[2] Vrije Univ Brussel, Dept ETRO, B-1050 Brussels, Belgium
关键词
dynamic Bayesian networks; Bayesian tangent shape model; audio-visual; speech recognition;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
the asynchrony for speech and lip movement is key problem of audio-visual speech recognition (AVSR) system. A Multi-Stream Asynchrony Dynamic Bayesian Network (MS-ADBN) model is proposed for audio-visual speech recognition. Comparing with Multi-Stream HMM (MSE[MM), MS-ADBN model describes the asynchrony of audio stream and visual stream to the word level. Simultaneously, based on profile of lip implemented by using Bayesian Tangent Shape Model (BTSM), Linear Discrimination Analysis (LDA) is used for visual feature extraction which describes the dynamic feature of lip and removes the redundancy of lip geometrical feature. The experiments results on continuous digit audio-visual database show that Up dynamic feature based on BTSM and LDA is more stable and robust than direct lip geometrical feature. In the noisy environments with signal to noise ratios ranging from 0dB to 30dB, comparing with MSHMM, MS-ADBN model with MFCC and LDA visual features has an average improvement of 4.92% in speech recognition rate.
引用
收藏
页码:1044 / +
页数:2
相关论文
共 50 条
  • [31] Audio-visual speech recognition by speechreading
    Zhang, XZ
    Mersereau, RM
    Clements, MA
    DSP 2002: 14TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING PROCEEDINGS, VOLS 1 AND 2, 2002, : 1069 - 1072
  • [32] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition
    Sterpu, George
    Saam, Christian
    Harte, Naomi
    ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 111 - 115
  • [33] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Zhang, Zi-Qiang
    Zhang, Jie
    Zhang, Jian-Shu
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
  • [34] Speech extraction based on ica and audio-visual coherence
    Sodoyer, D
    Girin, L
    Jutten, C
    Schwartz, JL
    SEVENTH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS, VOL 2, PROCEEDINGS, 2003, : 65 - 68
  • [35] Connectionism based audio-visual speech recognition method
    Che, Na
    Zhu, Yi-Ming
    Zhao, Jian
    Sun, Lei
    Shi, Li-Juan
    Zeng, Xian-Wei
    Jilin Daxue Xuebao (Gongxueban)/Journal of Jilin University (Engineering and Technology Edition), 2024, 54 (10): : 2984 - 2993
  • [36] Audio-Visual Domain Adaptation Feature Fusion for Speech Emotion Recognition
    Wei, Jie
    Hu, Guanyu
    Yang, Xinyu
    Luu, Anh Tuan
    Dong, Yizhuo
    INTERSPEECH 2022, 2022, : 1988 - 1992
  • [37] Audio-Visual Speech Recognition in Noisy Audio Environments
    Palecek, Karel
    Chaloupka, Josef
    2013 36TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2013, : 484 - 487
  • [38] Toward Robust Mispronunciation Detection via Audio-Visual Speech Recognition
    Karbasi, Mahdie
    Zeiler, Steffen
    Freiwald, Jan
    Kolossa, Dorothea
    ADVANCES IN COMPUTATIONAL INTELLIGENCE, IWANN 2019, PT II, 2019, 11507 : 655 - 666
  • [39] A LIP GEOMETRY APPROACH FOR FEATURE-FUSION BASED AUDIO-VISUAL SPEECH RECOGNITION
    Ibrahim, M. Z.
    Mulvaney, D. J.
    2014 6TH INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS, CONTROL AND SIGNAL PROCESSING (ISCCSP), 2014, : 644 - 647
  • [40] Audio-Visual Speech Modeling for Continuous Speech Recognition
    Dupont, Stephane
    Luettin, Juergen
    IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) : 141 - 151