Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations

被引:5
|
作者
Oneata, Dan [1 ]
Cucu, Horia [1 ]
机构
[1] Univ Politehn Bucuresti, Speech & Dialogue Res Lab, Bucharest, Romania
关键词
MODEL;
D O I
10.1109/CVPRW56347.2022.00504
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Multimodal speech recognition aims to improve the performance of automatic speech recognition (ASR) systems by leveraging additional visual information that is usually associated to the audio input. While previous approaches make crucial use of strong visual representations, e.g. by fine-tuning pretrained image recognition networks, significantly less attention has been paid to its counterpart: the speech component. In this work, we investigate ways of improving the base speech recognition system by following similar techniques to the ones used for the visual encoder, namely, transferring representations and data augmentation. First, we show that starting from a pretrained ASR significantly improves the state-of-the-art performance; remarkably, even when building upon a strong unimodal system, we still find gains by including the visual modality. Second, we employ speech data augmentation techniques to encourage the multimodal system to attend to the visual stimuli. This technique replaces previously used word masking and comes with the benefits of being conceptually simpler and yielding consistent improvements in the multimodal setting. We provide empirical results on three multimodal datasets, including the newly introduced Localized Narratives.
引用
收藏
页码:4578 / 4587
页数:10
相关论文
共 50 条
  • [31] Data Augmentation Improves Recognition of Foreign Accented Speech
    Fukuda, Takashi
    Fernandez, Raul
    Rosenberg, Andrew
    Thomas, Samuel
    Ramabhadran, Bhuvana
    Sorin, Alexander
    Kurata, Gakuto
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2409 - 2413
  • [32] Improving Transformer-based Speech Recognition Systems with Compressed Structure and Speech Attributes Augmentation
    Li, Sheng
    Raj, Dabre
    Lu, Xugang
    Shen, Peng
    Kawahara, Tatsuya
    Kawai, Hisashi
    INTERSPEECH 2019, 2019, : 4400 - 4404
  • [33] Multimodal Data Fusion Architectures in Audiovisual Speech Recognition
    Sayed, Hadeer M.
    ElDeeb, Hesham E.
    Taiel, Shereen A.
    INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 2, WORLDCIST 2023, 2024, 800 : 655 - 667
  • [34] EXPLOITING MULTIMODAL DATA FUSION IN ROBUST SPEECH RECOGNITION
    Heracleous, Panikos
    Badin, Pierre
    Bailly, Gerard
    Hagita, Norihiro
    2010 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME 2010), 2010, : 568 - 572
  • [35] CyclicAugment: Speech Data Random Augmentation with Cosine Annealing Scheduler for Automatic Speech Recognition
    Wang, Zhihan
    Hou, Feng
    Qiu, Yuanhang
    Ma, Zhizhong
    Singh, Satwinder
    Wang, Ruili
    INTERSPEECH 2022, 2022, : 3859 - 3863
  • [36] Few-shot dysarthric speech recognition with text-to-speech data augmentation
    Hermann, Enno
    Magimai-Doss, Mathew
    INTERSPEECH 2023, 2023, : 156 - 160
  • [37] Audio Augmentation for Speech Recognition
    Ko, Tom
    Peddinti, Vijayaditya
    Povey, Daniel
    Khudanpur, Sanjeev
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 3586 - 3589
  • [38] Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition
    Lin, Yist Y.
    Han, Tao
    Xu, Haihua
    Van Tung Pham
    Khassanov, Yerbolat
    Chong, Tze Yuang
    He, Yi
    Lu, Lu
    Ma, Zejun
    INTERSPEECH 2023, 2023, : 904 - 908
  • [39] Multimodal systems for speech recognition
    Mamyrbayev, Orken Zh
    Alimhan, Keylan
    Amirgaliyev, Beibut
    Zhumazhanov, Bagashar
    Mussayeva, Dinara
    Gusmanova, Farida
    INTERNATIONAL JOURNAL OF MOBILE COMMUNICATIONS, 2020, 18 (03) : 314 - 326
  • [40] Multimodal recognition of speech and electrocorticogram
    Ahuja, Mitali
    Komeiji, Shuji
    Mitsuhashi, Takumi
    Iimura, Yasushi
    Suzuki, Hiroharu
    Sugano, Hidenori
    Shinoda, Koichi
    Tanaka, Toshihisa
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 546 - 550