Robust Self-Supervised Audio-Visual Speech Recognition

被引：14

作者：

Shi, Bowen ^{[1
]}

Hsu, Wei-Ning ^{[2
]}

Mohamed, Abdelrahman ^{[2
]}

机构：

[1] Toyota Technol Inst Chicago, Chicago, IL 61801 USA

[2] Meta AI, New York, NY USA

来源：

INTERSPEECH 2022 | 2022年

关键词：

audio-visual speech recognition; self-supervised learning; representation learning; robust speech recognition;

D O I：

10.21437/Interspeech.2022-99

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe. Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the progress was hindered by the amount of labeled data available. In this work, we present a self-supervised AVSR framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model. On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by similar to 50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in the presence of babble noise, while reducing the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average (1).

引用

页码：2118 / 2122

页数：5

共 50 条

[41] Audio-Visual Speech Recognition in Noisy Audio Environments
Palecek, Karel
Chaloupka, Josef
2013 36TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2013, : 484 - 487
[42] Robust Audio-Visual Contrastive Learning for Proposal-Based Self-Supervised Sound Source Localization in Videos
Xuan, Hanyu
Wu, Zhiliang
Yang, Jian
Jiang, Bo
Luo, Lei
Alameda-Pineda, Xavier
Yan, Yan
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (07) : 4896 - 4907
[43] Audio-Visual Speech Modeling for Continuous Speech Recognition
Dupont, Stephane
Luettin, Juergen
IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) : 141 - 151
[44] Incorporating Visual Information in Audio Based Self-Supervised Speaker Recognition
Cai, Danwei
Wang, Weiqing
Li, Ming
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1422 - 1435
[45] MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
Anwar, Mohamed
Shi, Bowen
Goswami, Vedanuj
Hsu, Wei-Ning
Pino, Juan
Wang, Changhan
INTERSPEECH 2023, 2023, : 4064 - 4068
[46] A ROBUST AUDIO-VISUAL SPEECH ENHANCEMENT MODEL
Wang, Wupeng
Xing, Chao
Wang, Dong
Chen, Xiao
Sun, Fengyu
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7529 - 7533
[47] Speaker independent audio-visual speech recognition
Zhang, Y
Levinson, S
Huang, T
2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076
[48] A coupled HMM for audio-visual speech recognition
Nefian, AV
Liang, LH
Pi, XB
Xiaoxiang, L
Mao, C
Murphy, K
2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2013 - 2016
[49] An asynchronous DBN for audio-visual speech recognition
Saenko, Kate
Livescu, Karen
2006 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, 2006, : 154 - +
[50] Audio-visual modeling for bimodal speech recognition
Kaynak, MN
Zhi, Q
Cheok, AD
Sengupta, K
Chung, KC
2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 181 - 186

← 1 2 3 4 5 →