Sequence-to-Sequence Multi-Modal Speech In-Painting

被引:0
|
作者
Elyaderani, Mahsa Kadkhodaei [1 ]
Shirani, Shahram [1 ]
机构
[1] McMaster Univ, Dept Computat Sci & Engn, Hamilton, ON, Canada
来源
关键词
speech enhancement; speech in-painting; sequence-to-sequence models; multi-modality; Long Short-Term Memory networks; AUDIO; INTERPOLATION;
D O I
10.21437/Interspeech.2023-1848
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech in-painting is the task of regenerating missing audio contents using reliable context information. Despite various recent studies in multi-modal perception of audio in-painting, there is still a need for an effective infusion of visual and auditory information in speech in-painting. In this paper, we introduce a novel sequence-to-sequence model that leverages the visual information to in-paint audio signals via an encoder-decoder architecture. The encoder plays the role of a lip-reader for facial recordings and the decoder takes both encoder outputs as well as the distorted audio spectrograms to restore the original speech. Our model outperforms an audio-only speech inpainting model and has comparable results with a recent multimodal speech in-painter in terms of speech quality and intelligibility metrics for distortions of 300 ms to 1500 ms duration, which proves the effectiveness of the introduced multi-modality in speech in-painting.
引用
收藏
页码:829 / 833
页数:5
相关论文
共 50 条
  • [11] On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition
    Irie, Kazuki
    Prabhavalkar, Rohit
    Kannan, Anjuli
    Bruguier, Antoine
    Rybach, David
    Nguyen, Patrick
    INTERSPEECH 2019, 2019, : 3800 - 3804
  • [12] Enhancing Sequence-to-Sequence Text-to-Speech with Morphology
    Taylor, Jason
    Richmond, Korin
    INTERSPEECH 2020, 2020, : 1738 - 1742
  • [13] A Sequence-to-Sequence Pronunciation Model for Bangla Speech Synthesis
    Ahmad, Arif
    Hussain, Mohammed Raihan
    Selim, Mohammad Reza
    Iqbal, Muhammed Zafar
    Rahman, Mohammad Shahidur
    2018 INTERNATIONAL CONFERENCE ON BANGLA SPEECH AND LANGUAGE PROCESSING (ICBSLP), 2018,
  • [14] Analysis of Multilingual Sequence-to-Sequence Speech Recognition Systems
    Karafiat, Martin
    Baskar, Murali Karthick
    Watanabe, Shinji
    Hori, Takaaki
    Wiesner, Matthew
    Cernocky, Jan Honza
    INTERSPEECH 2019, 2019, : 2220 - 2224
  • [15] SUPERVISED ATTENTION IN SEQUENCE-TO-SEQUENCE MODELS FOR SPEECH RECOGNITION
    Yang, Gene-Ping
    Tang, Hao
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7222 - 7226
  • [16] Dysarthric Speech Transformer: A Sequence-to-Sequence Dysarthric Speech Recognition System
    Shahamiri, Seyed Reza
    Lal, Vanshika
    Shah, Dhvani
    IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, 2023, 31 : 3407 - 3416
  • [17] INVESTIGATION OF AN INPUT SEQUENCE ON THAI NEURAL SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS
    Janyoi, Pongsathon
    Thangthai, Ausdang
    2021 24TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA), 2021, : 218 - 223
  • [18] MULTI-SPEAKER SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS FOR DATA AUGMENTATION IN ACOUSTIC-TO-WORD SPEECH RECOGNITION
    Ueno, Sei
    Mimura, Masato
    Sakai, Shinsuke
    Kawahara, Tatsuya
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6161 - 6165
  • [19] Sequence-to-Sequence Models Can Directly Translate Foreign Speech
    Weiss, Ron J.
    Chorowski, Jan
    Jaitly, Navdeep
    Wu, Yonghui
    Chen, Zhifeng
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2625 - 2629
  • [20] High Performance Sequence-to-Sequence Model for Streaming Speech Recognition
    Thai-Son Nguyen
    Ngoc-Quan Pham
    Stueker, Sebastian
    Waibel, Alex
    INTERSPEECH 2020, 2020, : 2147 - 2151