More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

被引:9
|
作者
Hassid, Michael [1 ]
Ramanovich, Michelle Tadmor [1 ]
Shillingford, Brendan [2 ]
Wang, Miaosen [2 ]
Jia, Ye [1 ]
Remez, Tal [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] DeepMind, London, England
关键词
D O I
10.1109/CVPR52688.2022.01033
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we present VDTTS, a Visually-Driven Text-to-Speech model. Motivated by dubbing, VDTTS takes ad-vantage of video frames as an additional input alongside text, and generates speech that matches the video signal. We demonstrate how this allows VDTTS to, unlike plain TTS models, generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video. Experimentally, we show our model produces well-synchronized outputs, approaching the video-speech synchronization quality of the ground-truth, on several challenging benchmarks including "in-the-wild" content from VoxCeleb2. Supplementary demo videos demonstrating video-speech synchronization, robustness to speaker ID swapping, and prosody, presented at the project page.(1)
引用
收藏
页码:10577 / 10587
页数:11
相关论文
共 50 条
  • [21] Novel Eigenpitch-based Prosody Model for Text-to-Speech Synthesis
    Tian, Jilei
    Nurminen, Jani
    Kiss, Imre
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 313 - 316
  • [22] Prosody model in a Mandarin Text-to-Speech System based on a hierarchical approach
    Pan, NH
    Jen, WT
    Yu, SS
    Yu, MS
    Huang, SY
    Wu, MJ
    2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 448 - 451
  • [23] Towards including prosody in a text-to-speech system for modern standard Arabic
    Ramsay, Allan
    Mansour, Hanady
    COMPUTER SPEECH AND LANGUAGE, 2008, 22 (01): : 84 - 103
  • [24] EXACT PROSODY CLONING IN ZERO-SHOT MULTISPEAKER TEXT-TO-SPEECH
    Lux, Florian
    Koch, Julia
    Vu, Ngoc Thang
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 962 - 969
  • [25] High-quality prosody generation in Mandarin text-to-speech system
    Guo, Qing
    Zhang, Jie
    Katae, Nobuyuki
    Yu, Hao
    Fujitsu Scientific and Technical Journal, 2010, 46 (01): : 40 - 46
  • [26] High-Quality Prosody Generation in Mandarin Text-to-Speech System
    Guo, Qing
    Zhang, Jie
    Katae, Nobuyuki
    Yu, Hao
    FUJITSU SCIENTIFIC & TECHNICAL JOURNAL, 2010, 46 (01): : 40 - 46
  • [27] Modeling stylized invariance and local variability of prosody in text-to-speech synthesis
    Chu, Min
    Zhao, Yong
    Chang, Eric
    SPEECH COMMUNICATION, 2006, 48 (06) : 716 - 726
  • [28] Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech
    Huang, Rongjie
    Zhang, Chunlei
    Ren, Yi
    Zhao, Zhou
    Yu, Dong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 8018 - 8034
  • [29] Implementation of high quality text-to-speech using words and diphones
    Shukla, SR
    Barnwell, TP
    2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING - VOL IV: SIGNAL PROCESSING FOR COMMUNICATIONS; VOL V: SIGNAL PROCESSING EDUCATION SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO & ELECTROACOUSTICS; VOL VI: SIGNAL PROCESSING THEORY & METHODS STUDENT FORUM, 2001, : 4020 - 4020
  • [30] A statistical model with hierarchical structure for predicting prosody in a mandarin text-to-speech system
    Yu, MS
    Pan, NH
    JOURNAL OF THE CHINESE INSTITUTE OF ENGINEERS, 2005, 28 (03) : 385 - 399