Sequence-to-Sequence Multi-Modal Speech In-Painting

被引：0

作者：

Elyaderani, Mahsa Kadkhodaei ^{[1
]}

Shirani, Shahram ^{[1
]}

机构：

[1] McMaster Univ, Dept Computat Sci & Engn, Hamilton, ON, Canada

来源：

INTERSPEECH 2023 | 2023年

关键词：

speech enhancement; speech in-painting; sequence-to-sequence models; multi-modality; Long Short-Term Memory networks; AUDIO; INTERPOLATION;

D O I：

10.21437/Interspeech.2023-1848

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech in-painting is the task of regenerating missing audio contents using reliable context information. Despite various recent studies in multi-modal perception of audio in-painting, there is still a need for an effective infusion of visual and auditory information in speech in-painting. In this paper, we introduce a novel sequence-to-sequence model that leverages the visual information to in-paint audio signals via an encoder-decoder architecture. The encoder plays the role of a lip-reader for facial recordings and the decoder takes both encoder outputs as well as the distorted audio spectrograms to restore the original speech. Our model outperforms an audio-only speech inpainting model and has comparable results with a recent multimodal speech in-painter in terms of speech quality and intelligibility metrics for distortions of 300 ms to 1500 ms duration, which proves the effectiveness of the introduced multi-modality in speech in-painting.

引用

页码：829 / 833

页数：5

共 50 条

[11] On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition
Irie, Kazuki
Prabhavalkar, Rohit
Kannan, Anjuli
Bruguier, Antoine
Rybach, David
Nguyen, Patrick
INTERSPEECH 2019, 2019, : 3800 - 3804
[12] Enhancing Sequence-to-Sequence Text-to-Speech with Morphology
Taylor, Jason
Richmond, Korin
INTERSPEECH 2020, 2020, : 1738 - 1742
[13] A Sequence-to-Sequence Pronunciation Model for Bangla Speech Synthesis
Ahmad, Arif
Hussain, Mohammed Raihan
Selim, Mohammad Reza
Iqbal, Muhammed Zafar
Rahman, Mohammad Shahidur
2018 INTERNATIONAL CONFERENCE ON BANGLA SPEECH AND LANGUAGE PROCESSING (ICBSLP), 2018,
[14] Analysis of Multilingual Sequence-to-Sequence Speech Recognition Systems
Karafiat, Martin
Baskar, Murali Karthick
Watanabe, Shinji
Hori, Takaaki
Wiesner, Matthew
Cernocky, Jan Honza
INTERSPEECH 2019, 2019, : 2220 - 2224
[15] SUPERVISED ATTENTION IN SEQUENCE-TO-SEQUENCE MODELS FOR SPEECH RECOGNITION
Yang, Gene-Ping
Tang, Hao
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7222 - 7226
[16] Dysarthric Speech Transformer: A Sequence-to-Sequence Dysarthric Speech Recognition System
Shahamiri, Seyed Reza
Lal, Vanshika
Shah, Dhvani
IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, 2023, 31 : 3407 - 3416
[17] INVESTIGATION OF AN INPUT SEQUENCE ON THAI NEURAL SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS
Janyoi, Pongsathon
Thangthai, Ausdang
2021 24TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA), 2021, : 218 - 223
[18] MULTI-SPEAKER SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS FOR DATA AUGMENTATION IN ACOUSTIC-TO-WORD SPEECH RECOGNITION
Ueno, Sei
Mimura, Masato
Sakai, Shinsuke
Kawahara, Tatsuya
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6161 - 6165
[19] Sequence-to-Sequence Models Can Directly Translate Foreign Speech
Weiss, Ron J.
Chorowski, Jan
Jaitly, Navdeep
Wu, Yonghui
Chen, Zhifeng
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2625 - 2629
[20] High Performance Sequence-to-Sequence Model for Streaming Speech Recognition
Thai-Son Nguyen
Ngoc-Quan Pham
Stueker, Sebastian
Waibel, Alex
INTERSPEECH 2020, 2020, : 2147 - 2151

← 1 2 3 4 5 →