Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition

被引:4
|
作者
Hwang, Jung-Wook [1 ]
Park, Jeongkyun [2 ]
Park, Rae-Hong [1 ,3 ]
Park, Hyung-Min [1 ]
机构
[1] Sogang Univ, Dept Elect Engn, Seoul 04107, South Korea
[2] Sogang Univ, Dept Artificial Intelligence, Seoul 04107, South Korea
[3] Sogang Univ, ICT Convergence Disaster Safety Res Inst, Seoul 04107, South Korea
基金
新加坡国家研究基金会;
关键词
Audio-visual speech recognition; Audio-visual speech enhancement; Deep learning; Joint training; Conformer; Robust speech recognition; DEREVERBERATION; NOISE;
D O I
10.1016/j.apacoust.2023.109478
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Visual features are attractive cues that can be used for robust automatic speech recognition (ASR). In par-ticular, speech recognition performance can be improved by combining audio with visual information obtained from the speaker's face rather than using only audio in acoustically unfavorable environments. For this reason, recently, studies on various audio-visual speech recognition (AVSR) models have been actively conducted. However, from the experimental results of the AVSR models, important information for speech recognition is mainly concentrated on audio signals, and visual information plays a role in enhancing the robustness of recognition when the audio signal is corrupted in noisy environments. Therefore, there is a limit to the improvement of the recognition performance of conventional AVSR mod-els in noisy environments. Unlike the conventional AVSR models that directly use input audio-visual information as it is, in this paper, we propose an AVSR model that first performs AVSE to enhance target speech based on audio-visual information and then uses both audio information enhanced by the AVSE and visual information such as the speaker's lips or face. In particular, we propose a deep AVSR model that performs end-to-end training as one model by integrating an AVSR model based on the conformer with hybrid decoding and an AVSE model based on the U-net with recurrent neural network (RNN) atten-tion (RA). Experimental results on the LRS2-BBC and LRS3-TED datasets demonstrate that the AVSE model effectively suppresses corrupting noise and the AVSR model successfully achieves noise robustness. Especially, the proposed jointly trained model integrating the AVSE and AVSR stages into one model showed better recognition performance than the other compared methods.& COPY; 2023 Elsevier Ltd. All rights reserved.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Using Twin-HMM-Based Audio-Visual Speech Enhancement as a Front-End for Robust Audio-Visual Speech Recognition
    Abdelaziz, Ahmed Hussen
    Zeiler, Steffen
    Kolossa, Dorothea
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 867 - 871
  • [2] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
    Tamura, Satoshi
    Ishikawa, Masato
    Hashiba, Takashi
    Takeuchi, Shin'ichi
    Hayamizu, Satoru
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
  • [3] Deep Audio-Visual Speech Recognition
    Afouras, Triantafyllos
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
  • [4] Audio-visual integration for speech recognition
    Kober, R
    Harz, U
    NEUROLOGY PSYCHIATRY AND BRAIN RESEARCH, 1996, 4 (04) : 179 - 184
  • [5] MULTIPOSE AUDIO-VISUAL SPEECH RECOGNITION
    Estellers, Virginia
    Thiran, Jean-Philippe
    19TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO-2011), 2011, : 1065 - 1069
  • [6] Audio-visual speech recognition by speechreading
    Zhang, XZ
    Mersereau, RM
    Clements, MA
    DSP 2002: 14TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING PROCEEDINGS, VOLS 1 AND 2, 2002, : 1069 - 1072
  • [7] An audio-visual speech recognition with a new mandarin audio-visual database
    Liao, Wen-Yuan
    Pao, Tsang-Long
    Chen, Yu-Te
    Chang, Tsun-Wei
    INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
  • [8] Audio-Visual Speech Modeling for Continuous Speech Recognition
    Dupont, Stephane
    Luettin, Juergen
    IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) : 141 - 151
  • [9] Audio-visual fuzzy fusion for robust speech recognition
    Malcangi, M.
    Ouazzane, K.
    Patel, P.
    2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2013,
  • [10] Audio-Visual Efficient Conformer for Robust Speech Recognition
    Burchi, Maxime
    Timofte, Radu
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2257 - 2266