Object-Centric Representation Learning for Video Scene Understanding

被引:0
|
作者
Zhou, Yi [1 ]
Zhang, Hui [1 ]
Park, Seung-In [2 ]
Yoo, ByungIn [2 ]
Qi, Xiaojuan [3 ]
机构
[1] Samsung R&D Inst China Beijing SRC B, Beijing 100028, Peoples R China
[2] Samsung Adv Inst Technol, Suwon 446712, South Korea
[3] Univ Hong Kong, Dept Elect & Elect Engn, Hong Kong, Peoples R China
关键词
Semantics; Task analysis; IP networks; Feature extraction; Pipelines; Estimation; Generators; Scene understanding; video panoptic segmentation; depth estimation; tracking; object-centric representation;
D O I
10.1109/TPAMI.2024.3401409
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Depth-aware Video Panoptic Segmentation (DVPS) is a challenging task that requires predicting the semantic class and 3D depth of each pixel in a video, while also segmenting and consistently tracking objects across frames. Predominant methodologies treat this as a multi-task learning problem, tackling each constituent task independently, thus restricting their capacity to leverage interrelationships amongst tasks and requiring parameter tuning for each task. To surmount these constraints, we present Slot-IVPS, a new approach employing an object-centric model to acquire unified object representations, thereby facilitating the model's ability to simultaneously capture semantic and depth information. Specifically, we introduce a novel representation, Integrated Panoptic Slots (IPS), to capture both semantic and depth information for all panoptic objects within a video, encompassing background semantics and foreground instances. Subsequently, we propose an integrated feature generator and enhancer to extract depth-aware features, alongside the Integrated Video Panoptic Retriever (IVPR), which iteratively retrieves spatial-temporal coherent object features and encodes them into IPS. The resulting IPS can be effortlessly decoded into an array of video outputs, including depth maps, classifications, masks, and object instance IDs. We undertake comprehensive analyses across four datasets, attaining state-of-the-art performance in both Depth-aware Video Panoptic Segmentation and Video Panoptic Segmentation tasks.
引用
收藏
页码:8410 / 8423
页数:14
相关论文
共 50 条
  • [1] Object-Centric Representation Learning for Video Question Answering
    Long Hoang Dang
    Thao Minh Le
    Vuong Le
    Truyen Tran
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [2] OCVOS: OBJECT-CENTRIC REPRESENTATION FOR VIDEO OBJECT SEGMENTATION
    Jo, Junho
    Wee, Dongyoon
    Cho, Nam Ik
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1655 - 1659
  • [3] Is an Object-Centric Video Representation Beneficial for Transfer?
    Zhang, Chuhan
    Gupta, Ankush
    Zisserman, Andrew
    COMPUTER VISION - ACCV 2022, PT IV, 2023, 13844 : 379 - 397
  • [4] Object-centric Scene Understanding for Image Memorability Prediction
    Yoon, Sejong
    Kim, Jongpil
    IEEE 1ST CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2018), 2018, : 305 - 308
  • [5] Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation
    Zhou, Yi
    Zhang, Hui
    Lee, Hana
    Sun, Shuyang
    Li, Pingjun
    Zhu, Yangguang
    Yoo, ByungIn
    Qi, Xiaojuan
    Han, Jae-Joon
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3083 - 3093
  • [6] Learning Object-Centric Transformation for Video Prediction
    Chen, Xiongtao
    Wang, Wenmin
    Wang, Jinzhuo
    Li, Weimian
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1503 - 1511
  • [7] DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding
    Yu, Xiaoxuan
    Wang, Hao
    Li, Weiming
    Wang, Qiang
    Cho, Soonyong
    Sung, Younghun
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 6826 - 6834
  • [8] Language-Mediated, Object-Centric Representation Learning
    Wang, Ruocheng
    Mao, Jiayuan
    Gershman, Samuel J.
    Wu, Jiajun
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2033 - 2046
  • [9] Object-Centric Representation Learning from Unlabeled Videos
    Gao, Ruohan
    Jayaraman, Dinesh
    Grauman, Kristen
    COMPUTER VISION - ACCV 2016, PT V, 2017, 10115 : 248 - 263
  • [10] Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation
    Fan, Ke
    Lei, Jingshi
    Qian, Xuelin
    Yu, Miaopeng
    Xiao, Tianjun
    He, Tong
    Zhang, Zheng
    Fu, Yanwei
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 1272 - 1281