Leveraging Modality-Specific Representations for Audio-Visual Speech Recognition via Reinforcement Learning

被引：0

作者：

Chen, Chen ^{[1
]}

Hu, Yuchen ^{[1
]}

Zhang, Qiang ^{[2
,3
]}

Zou, Heqing ^{[1
]}

Zhu, Beier ^{[1
]}

Chng, Eng Siong ^{[1
]}

机构：

[1] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore

[2] ZJU Hangzhou Global Sci & Technol Innovat Ctr, Hangzhou, Peoples R China

[3] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou, Peoples R China

来源：

THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11 | 2023年

基金：

新加坡国家研究基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.

引用

页码：12607 / +

页数：10

共 50 条

[21] Leveraging recent advances in deep learning for audio-Visual emotion recognition
Schoneveld, Liam
Othmani, Alice
Abdelkawy, Hazem
PATTERN RECOGNITION LETTERS, 2021, 146 : 1 - 7
[22] An audio-visual speech recognition system for testing new audio-visual databases
Pao, Tsang-Long
Liao, Wen-Yuan
VISAPP 2006: PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, VOL 2, 2006, : 192 - +
[23] MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
Cheng, Xize
Jin, Tao
Huang, Rongjie
Li, Linjun
Lin, Wang
Wang, Zehan
Wang, Ye
Liu, Huadai
Yin, Aoxiong
Zhao, Zhou
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15689 - 15699
[24] Audio-Visual Speech Recognition in Noisy Audio Environments
Palecek, Karel
Chaloupka, Josef
2013 36TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2013, : 484 - 487
[25] Integrative interaction of emotional speech in audio-visual modality
Dong, Haibin
Li, Na
Fan, Lingzhong
Wei, Jianguo
Xu, Junhai
FRONTIERS IN NEUROSCIENCE, 2022, 16
[26] Audio-Visual Speech Modeling for Continuous Speech Recognition
Dupont, Stephane
Luettin, Juergen
IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) : 141 - 151
[27] Learning Better Representations for Audio-Visual Emotion Recognition with Common Information
Ma, Fei
Zhang, Wei
Li, Yang
Huang, Shao-Lun
Zhang, Lin
APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 23
[28] A Robust Audio-visual Speech Recognition Using Audio-visual Voice Activity Detection
Tamura, Satoshi
Ishikawa, Masato
Hashiba, Takashi
Takeuchi, Shin'ichi
Hayamizu, Satoru
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2702 - +
[29] Toward Robust Mispronunciation Detection via Audio-Visual Speech Recognition
Karbasi, Mahdie
Zeiler, Steffen
Freiwald, Jan
Kolossa, Dorothea
ADVANCES IN COMPUTATIONAL INTELLIGENCE, IWANN 2019, PT II, 2019, 11507 : 655 - 666
[30] A coupled HMM for audio-visual speech recognition
Nefian, AV
Liang, LH
Pi, XB
Xiaoxiang, L
Mao, C
Murphy, K
2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2013 - 2016

← 1 2 3 4 5 →