More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

被引:9
|
作者
Hassid, Michael [1 ]
Ramanovich, Michelle Tadmor [1 ]
Shillingford, Brendan [2 ]
Wang, Miaosen [2 ]
Jia, Ye [1 ]
Remez, Tal [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] DeepMind, London, England
关键词
D O I
10.1109/CVPR52688.2022.01033
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we present VDTTS, a Visually-Driven Text-to-Speech model. Motivated by dubbing, VDTTS takes ad-vantage of video frames as an additional input alongside text, and generates speech that matches the video signal. We demonstrate how this allows VDTTS to, unlike plain TTS models, generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video. Experimentally, we show our model produces well-synchronized outputs, approaching the video-speech synchronization quality of the ground-truth, on several challenging benchmarks including "in-the-wild" content from VoxCeleb2. Supplementary demo videos demonstrating video-speech synchronization, robustness to speaker ID swapping, and prosody, presented at the project page.(1)
引用
收藏
页码:10577 / 10587
页数:11
相关论文
共 50 条
  • [31] Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech
    Zhang, Guangyan
    Merritt, Thomas
    Ribeiro, Manuel Sam
    Tura-Vecino, Biel
    Yanagisawa, Kayoko
    Pokora, Kamil
    Ezzerg, Abdelhamid
    Cygert, Sebastian
    Abbas, Ammar
    Bilinski, Piotr
    Barra-Chicote, Roberto
    Korzekwa, Daniel
    Lorenzo-Trueba, Jaime
    INTERSPEECH 2023, 2023, : 27 - 31
  • [32] Chinese Prosody Generation Based on C-ToBI Representation for Text-To-Speech
    Kim, Byeongchang
    ADVANCES IN COMPUTER SCIENCE AND INFORMATION TECHNOLOGY, PROCEEDINGS, 2010, 6059 : 558 - 571
  • [33] REPETITION AND RE-START STRATEGIES FOR PROSODY IN TEXT-TO-SPEECH CONVERSION SYSTEMS
    LAVER, J
    SPEECH COMMUNICATION, 1993, 13 (1-2) : 75 - 85
  • [34] AN EVALUATION OF MONGOLIAN DATA-DRIVEN TEXT-TO-SPEECH
    Altangerel, Chagnaa
    Purev, Jaimai
    Yesyenbyek, Kerey
    Hansakunbuntheung, Chatchawarn
    2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2013,
  • [35] PROSOSPEECH: ENHANCING PROSODY WITH QUANTIZED VECTOR PRE-TRAINING IN TEXT-TO-SPEECH
    Ren, Yi
    Lei, Ming
    Huang, Zhiying
    Zhang, Shiliang
    Chen, Qian
    Yan, Zhijie
    Zhao, Zhou
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7577 - 7581
  • [36] More than words: Some reflections on working visually
    Halford, S
    Knowles, C
    SOCIOLOGICAL RESEARCH ONLINE, 2005, 10 (01):
  • [37] Automatic conversion from lexical words to prosodic words for mandarin text-to-speech system
    Shao, Yanqiu
    Han, Jiqing
    Liu, Ting
    Zhao, Yongzhen
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2007, 10 (01) : 45 - 55
  • [38] The Information Structure-prosody interface in text-to-speech technologies. An empirical perspective
    Dominguez, Monica
    Farrus, Mireia
    Wanner, Leo
    CORPUS LINGUISTICS AND LINGUISTIC THEORY, 2022, 18 (02) : 419 - 445
  • [39] A novel prosody adaptation method for Mandarin concatenation-based text-to-speech system
    Yu, Jian
    Tao, Jianhua
    ACOUSTICAL SCIENCE AND TECHNOLOGY, 2009, 30 (01) : 33 - 41
  • [40] RECENT IMPROVEMENTS OF PROBABILITY BASED PROSODY MODELS FOR UNIT SELECTION IN CONCATENATIVE TEXT-TO-SPEECH
    Zhang, Wei
    Gu, Liang
    Gao, Yuqing
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 3777 - 3780