More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

被引：9

作者：

Hassid, Michael ^{[1
]}

Ramanovich, Michelle Tadmor ^{[1
]}

Shillingford, Brendan ^{[2
]}

Wang, Miaosen ^{[2
]}

Jia, Ye ^{[1
]}

Remez, Tal ^{[1
]}

机构：

[1] Google Res, Mountain View, CA 94043 USA

[2] DeepMind, London, England

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.01033

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper we present VDTTS, a Visually-Driven Text-to-Speech model. Motivated by dubbing, VDTTS takes ad-vantage of video frames as an additional input alongside text, and generates speech that matches the video signal. We demonstrate how this allows VDTTS to, unlike plain TTS models, generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video. Experimentally, we show our model produces well-synchronized outputs, approaching the video-speech synchronization quality of the ground-truth, on several challenging benchmarks including "in-the-wild" content from VoxCeleb2. Supplementary demo videos demonstrating video-speech synchronization, robustness to speaker ID swapping, and prosody, presented at the project page.(1)

引用

页码：10577 / 10587

页数：11

共 50 条

[31] Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech
Zhang, Guangyan
Merritt, Thomas
Ribeiro, Manuel Sam
Tura-Vecino, Biel
Yanagisawa, Kayoko
Pokora, Kamil
Ezzerg, Abdelhamid
Cygert, Sebastian
Abbas, Ammar
Bilinski, Piotr
Barra-Chicote, Roberto
Korzekwa, Daniel
Lorenzo-Trueba, Jaime
INTERSPEECH 2023, 2023, : 27 - 31
[32] Chinese Prosody Generation Based on C-ToBI Representation for Text-To-Speech
Kim, Byeongchang
ADVANCES IN COMPUTER SCIENCE AND INFORMATION TECHNOLOGY, PROCEEDINGS, 2010, 6059 : 558 - 571
[33] REPETITION AND RE-START STRATEGIES FOR PROSODY IN TEXT-TO-SPEECH CONVERSION SYSTEMS
LAVER, J
SPEECH COMMUNICATION, 1993, 13 (1-2) : 75 - 85
[34] AN EVALUATION OF MONGOLIAN DATA-DRIVEN TEXT-TO-SPEECH
Altangerel, Chagnaa
Purev, Jaimai
Yesyenbyek, Kerey
Hansakunbuntheung, Chatchawarn
2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2013,
[35] PROSOSPEECH: ENHANCING PROSODY WITH QUANTIZED VECTOR PRE-TRAINING IN TEXT-TO-SPEECH
Ren, Yi
Lei, Ming
Huang, Zhiying
Zhang, Shiliang
Chen, Qian
Yan, Zhijie
Zhao, Zhou
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7577 - 7581
[36] More than words: Some reflections on working visually
Halford, S
Knowles, C
SOCIOLOGICAL RESEARCH ONLINE, 2005, 10 (01):
[37] Automatic conversion from lexical words to prosodic words for mandarin text-to-speech system
Shao, Yanqiu
Han, Jiqing
Liu, Ting
Zhao, Yongzhen
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2007, 10 (01) : 45 - 55
[38] The Information Structure-prosody interface in text-to-speech technologies. An empirical perspective
Dominguez, Monica
Farrus, Mireia
Wanner, Leo
CORPUS LINGUISTICS AND LINGUISTIC THEORY, 2022, 18 (02) : 419 - 445
[39] A novel prosody adaptation method for Mandarin concatenation-based text-to-speech system
Yu, Jian
Tao, Jianhua
ACOUSTICAL SCIENCE AND TECHNOLOGY, 2009, 30 (01) : 33 - 41
[40] RECENT IMPROVEMENTS OF PROBABILITY BASED PROSODY MODELS FOR UNIT SELECTION IN CONCATENATIVE TEXT-TO-SPEECH
Zhang, Wei
Gu, Liang
Gao, Yuqing
2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 3777 - 3780

← 1 2 3 4 5 →