Robust Self-Supervised Audio-Visual Speech Recognition

被引:14
|
作者
Shi, Bowen [1 ]
Hsu, Wei-Ning [2 ]
Mohamed, Abdelrahman [2 ]
机构
[1] Toyota Technol Inst Chicago, Chicago, IL 61801 USA
[2] Meta AI, New York, NY USA
来源
关键词
audio-visual speech recognition; self-supervised learning; representation learning; robust speech recognition;
D O I
10.21437/Interspeech.2022-99
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe. Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the progress was hindered by the amount of labeled data available. In this work, we present a self-supervised AVSR framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model. On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by similar to 50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in the presence of babble noise, while reducing the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average (1).
引用
收藏
页码:2118 / 2122
页数:5
相关论文
共 50 条
  • [41] Audio-Visual Speech Recognition in Noisy Audio Environments
    Palecek, Karel
    Chaloupka, Josef
    2013 36TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2013, : 484 - 487
  • [42] Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos
    Xuan, Hanyu
    Wu, Zhiliang
    Yang, Jian
    Jiang, Bo
    Luo, Lei
    Alameda-Pineda, Xavier
    Yan, Yan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (07) : 4896 - 4907
  • [43] Audio-Visual Speech Modeling for Continuous Speech Recognition
    Dupont, Stephane
    Luettin, Juergen
    IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) : 141 - 151
  • [44] Incorporating Visual Information in Audio Based Self-Supervised Speaker Recognition
    Cai, Danwei
    Wang, Weiqing
    Li, Ming
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1422 - 1435
  • [45] MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
    Anwar, Mohamed
    Shi, Bowen
    Goswami, Vedanuj
    Hsu, Wei-Ning
    Pino, Juan
    Wang, Changhan
    INTERSPEECH 2023, 2023, : 4064 - 4068
  • [46] A ROBUST AUDIO-VISUAL SPEECH ENHANCEMENT MODEL
    Wang, Wupeng
    Xing, Chao
    Wang, Dong
    Chen, Xiao
    Sun, Fengyu
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7529 - 7533
  • [47] Speaker independent audio-visual speech recognition
    Zhang, Y
    Levinson, S
    Huang, T
    2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076
  • [48] A coupled HMM for audio-visual speech recognition
    Nefian, AV
    Liang, LH
    Pi, XB
    Xiaoxiang, L
    Mao, C
    Murphy, K
    2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2013 - 2016
  • [49] An asynchronous DBN for audio-visual speech recognition
    Saenko, Kate
    Livescu, Karen
    2006 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, 2006, : 154 - +
  • [50] Audio-visual modeling for bimodal speech recognition
    Kaynak, MN
    Zhi, Q
    Cheok, AD
    Sengupta, K
    Chung, KC
    2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 181 - 186