Robust Self-Supervised Audio-Visual Speech Recognition

被引:14
|
作者
Shi, Bowen [1 ]
Hsu, Wei-Ning [2 ]
Mohamed, Abdelrahman [2 ]
机构
[1] Toyota Technol Inst Chicago, Chicago, IL 61801 USA
[2] Meta AI, New York, NY USA
来源
关键词
audio-visual speech recognition; self-supervised learning; representation learning; robust speech recognition;
D O I
10.21437/Interspeech.2022-99
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe. Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the progress was hindered by the amount of labeled data available. In this work, we present a self-supervised AVSR framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model. On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by similar to 50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in the presence of babble noise, while reducing the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average (1).
引用
收藏
页码:2118 / 2122
页数:5
相关论文
共 50 条
  • [21] AUDIO-VISUAL SPEECH ENHANCEMENT AND SEPARATION BY UTILIZING MULTI-MODAL SELF-SUPERVISED EMBEDDINGS
    Chern, I-Chun
    Hung, Kuo-Hsuan
    Chen, Yi-Ting
    Hussain, Tassadaq
    Gogate, Mandar
    Hussain, Amir
    Tsao, Yu
    Hou, Jen-Cheng
    2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [22] Robust audio-visual speech recognition based on late integration
    Lee, Jong-Seok
    Park, Cheol Hoon
    IEEE TRANSACTIONS ON MULTIMEDIA, 2008, 10 (05) : 767 - 779
  • [23] AUDIO-VISUAL DEEP LEARNING FOR NOISE ROBUST SPEECH RECOGNITION
    Huang, Jing
    Kingsbury, Brian
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7596 - 7599
  • [24] Robust Audio-Visual Speech Recognition Based on Hybrid Fusion
    Liu, Hong
    Li, Wenhao
    Yang, Bing
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 7580 - 7586
  • [25] Self-Supervised Moving Vehicle Detection From Audio-Visual Cues
    Zuern, Jannik
    Burgard, Wolfram
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 7415 - 7422
  • [26] Comparing Learning Methodologies for Self-Supervised Audio-Visual Representation Learning
    Terbouche, Hacene
    Schoneveld, Liam
    Benson, Oisin
    Othmani, Alice
    IEEE ACCESS, 2022, 10 : 41622 - 41638
  • [27] Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos
    Feng, Zishun
    Tu, Ming
    Xia, Rui
    Wang, Yuxuan
    Krishnamurthy, Ashok
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 5671 - 5672
  • [28] Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds
    Sato, Tomoya
    Sugano, Yusuke
    Sato, Yoichi
    IEEE ACCESS, 2022, 10 : 94273 - 94284
  • [29] An audio-visual speech recognition with a new mandarin audio-visual database
    Liao, Wen-Yuan
    Pao, Tsang-Long
    Chen, Yu-Te
    Chang, Tsun-Wei
    INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
  • [30] Self-supervised Learning for Speech Emotion Recognition Task Using Audio-visual Features and Distil Hubert Model on BAVED and RAVDESS Databases
    Dabbabi, Karim
    Mars, Abdelkarim
    JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING, 2024, 33 (05) : 576 - 606