Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations

被引：5

作者：

Oneata, Dan ^{[1
]}

Cucu, Horia ^{[1
]}

机构：

[1] Univ Politehn Bucuresti, Speech & Dialogue Res Lab, Bucharest, Romania

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022 | 2022年

关键词：

MODEL;

D O I：

10.1109/CVPRW56347.2022.00504

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Multimodal speech recognition aims to improve the performance of automatic speech recognition (ASR) systems by leveraging additional visual information that is usually associated to the audio input. While previous approaches make crucial use of strong visual representations, e.g. by fine-tuning pretrained image recognition networks, significantly less attention has been paid to its counterpart: the speech component. In this work, we investigate ways of improving the base speech recognition system by following similar techniques to the ones used for the visual encoder, namely, transferring representations and data augmentation. First, we show that starting from a pretrained ASR significantly improves the state-of-the-art performance; remarkably, even when building upon a strong unimodal system, we still find gains by including the visual modality. Second, we employ speech data augmentation techniques to encourage the multimodal system to attend to the visual stimuli. This technique replaces previously used word masking and comes with the benefits of being conceptually simpler and yielding consistent improvements in the multimodal setting. We provide empirical results on three multimodal datasets, including the newly introduced Localized Narratives.

引用

页码：4578 / 4587

页数：10

共 50 条

[21] You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation
Laptev, Aleksandr
Korostik, Roman
Svischev, Aleksey
Andrusenko, Andrei
Medennikov, Ivan
Rybin, Sergey
2020 13TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, BIOMEDICAL ENGINEERING AND INFORMATICS (CISP-BMEI 2020), 2020, : 439 - 444
[22] SPEECH AUGMENTATION USING WAVENET IN SPEECH RECOGNITION
Wang, Jisung
Kim, Sangki
Lee, Yeha
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6770 - 6774
[23] IMPROVING SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION TRAINING WITH ON-THE-FLY DATA AUGMENTATION
Nguyen, Thai-Son
Stuker, Sebastian
Niehues, Jan
Waibel, Alex
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7689 - 7693
[24] Improving Children's Speech Recognition through Out-of-Domain Data Augmentation
Fainberg, Joachim
Bell, Peter
Lincoln, Mike
Renals, Steve
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1598 - 1602
[25] Data Augmentation for Improving Explainability of Hate Speech Detection
Gunjan Ansari
Parmeet Kaur
Chandni Saxena
Arabian Journal for Science and Engineering, 2024, 49 : 3609 - 3621
[26] Data Augmentation for Improving Explainability of Hate Speech Detection
Ansari, Gunjan
Kaur, Parmeet
Saxena, Chandni
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2024, 49 (03) : 3609 - 3621
[27] Adaptive data augmentation for mandarin automatic speech recognition
Ding, Kai
Li, Ruixuan
Xu, Yuelin
Du, Xingyue
Deng, Bin
APPLIED INTELLIGENCE, 2024, 54 (07) : 5674 - 5687
[28] Adversarial Data Augmentation Network for Speech Emotion Recognition
Yi, Lu
Mak, Man-Wai
2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 529 - 534
[29] Investigation of Data Augmentation Techniques for Disordered Speech Recognition
Geng, Mengzhe
Xie, Xurong
Liu, Shansong
Yu, Jianwei
Hu, Shoukang
Liu, Xunying
Meng, Helen
INTERSPEECH 2020, 2020, : 696 - 700
[30] Data Augmentation using GANs for Speech Emotion Recognition
Chatziagapi, Aggelina
Paraskevopoulos, Georgios
Sgouropoulos, Dimitris
Pantazopoulos, Georgios
Nikandrou, Malvina
Giannakopoulos, Theodoros
Katsamanis, Athanasios
Potamianos, Alexandros
Narayanan, Shrikanth
INTERSPEECH 2019, 2019, : 171 - 175

← 1 2 3 4 5 →