SceneScape: Text-Driven Consistent Scene Generation

被引:0
|
作者
Fridman, Rafail [1 ]
Abecasis, Amit [1 ]
Kasten, Yoni [2 ]
Dekel, Tali [1 ]
机构
[1] Weizmann Inst Sci, Rehovot, Israel
[2] NVIDIA Res, Santa Clara, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a method for text-driven perpetual view generation - synthesizing long-term videos of various scenes solely from an input text prompt describing the scene and camera poses. We introduce a novel framework that generates such videos in an online fashion by combining the generative power of a pre-trained text-to-image model with the geometric priors learned by a pre-trained monocular depth prediction model. To tackle the pivotal challenge of achieving 3D consistency, i.e., synthesizing videos that depict geometrically-plausible scenes, we deploy an online test-time training to encourage the predicted depth map of the current frame to be geometrically consistent with the synthesized scene. The depth maps are used to construct a unified mesh representation of the scene, which is progressively constructed along the video generation process. In contrast to previous works, which are applicable only to limited domains, our method generates diverse scenes, such as walkthroughs in spaceships, caves, or ice castles. Project page: https://scenescape.github.io/
引用
收藏
页数:18
相关论文
共 50 条
  • [31] Free-Editor: Zero-Shot Text-Driven 3D Scene Editing
    Karim, Nazmul
    Igbal, Hasan
    Khalid, Umar
    Chen, Chen
    Hua, Jing
    COMPUTER VISION - ECCV 2024, PT LXXX, 2025, 15138 : 436 - 453
  • [32] StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
    Patashnik, Or
    Wu, Zongze
    Shechtman, Eli
    Cohen-Or, Daniel
    Lischinski, Dani
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2065 - 2074
  • [33] InterFusion: Text-Driven Generation of 3D Human-Object Interaction
    Dai, Sisi
    Li, Wenhao
    Sun, Haowen
    Huang, Haibin
    Ma, Chongyang
    Huang, Hui
    Xu, Kai
    Hu, Ruizhen
    COMPUTER VISION - ECCV 2024, PT XLVIII, 2025, 15106 : 18 - 35
  • [34] Text2Mesh: Text-Driven Neural Stylization for Meshes
    Michel, Oscar
    Bar-On, Roi
    Liu, Richard
    Benaim, Sagie
    Hanocka, Rana
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 13482 - 13492
  • [35] Unsupervised Prompt Tuning for Text-Driven Object Detection
    He, Weizhen
    Chen, Weijie
    Chen, Binbin
    Yang, Shicai
    Xie, Di
    Lin, Luojun
    Qi, Donglian
    Zhuang, Yueting
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2651 - 2661
  • [36] AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars
    Hong, Fangzhou
    Zhang, Mingyuan
    Pan, Liang
    Cai, Zhongang
    Yang, Lei
    Liu, Ziwei
    ACM TRANSACTIONS ON GRAPHICS, 2022, 41 (04):
  • [37] Exploring Text-Driven Approaches for Online Action Detection
    Benavent-Lledo, Manuel
    Mulero-Perez, David
    Ortiz-Perez, David
    Garcia-Rodriguez, Jose
    Orts-Escolano, Sergio
    BIOINSPIRED SYSTEMS FOR TRANSLATIONAL APPLICATIONS: FROM ROBOTICS TO SOCIAL ENGINEERING, PT II, IWINAC 2024, 2024, 14675 : 55 - 64
  • [38] Blended Diffusion for Text-driven Editing of Natural Images
    Avrahami, Omri
    Lischinski, Dani
    Fried, Ohad
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18187 - 18197
  • [39] Comparing text-driven and speech-driven visual speech synthesisers
    Theobald, Barry-John
    Cawley, Gavin
    Bangham, Andrew
    Matthews, Iain
    Wilkinson, Nicholas
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2322 - 2322
  • [40] Explainable Text-Driven Neural Network for Stock Prediction
    Yang, Linyi
    Zhang, Zheng
    Xiong, Su
    Wei, Lirui
    Ng, James
    Xu, Lina
    Dong, Ruihai
    PROCEEDINGS OF 2018 5TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (CCIS), 2018, : 441 - 445