Object-Centric Representation Learning for Video Question Answering

被引:1
|
作者
Long Hoang Dang [1 ]
Thao Minh Le [1 ]
Vuong Le [1 ]
Truyen Tran [1 ]
机构
[1] Deakin Univ, Appl Artificial Intelligence Inst, Burwood, Australia
关键词
D O I
10.1109/IJCNN52387.2021.9533961
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video question answering (Video QA) presents a powerful testbed for human-like intelligent behaviors. The task demands new capabilities to integrate video processing, language understanding, binding abstract linguistic concepts to concrete visual artifacts, and deliberative reasoning over space-time. Neural networks offer a promising approach to reach this potential through learning from examples rather than handcrafting features and rules. However, neural networks are predominantly feature-based - they map data to unstructured vectorial representation and thus can fall into the trap of exploiting shortcuts through surface statistics instead of true systematic reasoning seen in symbolic systems. To tackle this issue, we advocate for object-centric representation as a basis for constructing spatio-temporal structures from videos, essentially bridging the semantic gap between low-level pattern recognition and high-level symbolic algebra. To this end, we propose a new query-guided representation framework to turn a video into an evolving relational graph of objects, whose features and interactions are dynamically and conditionally inferred. The object lives are then summarized into resumes, lending naturally for deliberative relational reasoning that produces an answer to the query. The framework is evaluated on major Video QA datasets, demonstrating clear benefits of the object-centric approach to video reasoning.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Object-Centric Representation Learning for Video Scene Understanding
    Zhou, Yi
    Zhang, Hui
    Park, Seung-In
    Yoo, ByungIn
    Qi, Xiaojuan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 8410 - 8423
  • [2] OCVOS: OBJECT-CENTRIC REPRESENTATION FOR VIDEO OBJECT SEGMENTATION
    Jo, Junho
    Wee, Dongyoon
    Cho, Nam Ik
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1655 - 1659
  • [3] Is an Object-Centric Video Representation Beneficial for Transfer?
    Zhang, Chuhan
    Gupta, Ankush
    Zisserman, Andrew
    COMPUTER VISION - ACCV 2022, PT IV, 2023, 13844 : 379 - 397
  • [4] Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation
    Zhou, Yi
    Zhang, Hui
    Lee, Hana
    Sun, Shuyang
    Li, Pingjun
    Zhu, Yangguang
    Yoo, ByungIn
    Qi, Xiaojuan
    Han, Jae-Joon
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3083 - 3093
  • [5] Learning Object-Centric Transformation for Video Prediction
    Chen, Xiongtao
    Wang, Wenmin
    Wang, Jinzhuo
    Li, Weimian
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1503 - 1511
  • [6] Language-Mediated, Object-Centric Representation Learning
    Wang, Ruocheng
    Mao, Jiayuan
    Gershman, Samuel J.
    Wu, Jiajun
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2033 - 2046
  • [7] Object-Centric Representation Learning from Unlabeled Videos
    Gao, Ruohan
    Jayaraman, Dinesh
    Grauman, Kristen
    COMPUTER VISION - ACCV 2016, PT V, 2017, 10115 : 248 - 263
  • [8] Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation
    Fan, Ke
    Lei, Jingshi
    Qian, Xuelin
    Yu, Miaopeng
    Xiao, Tianjun
    He, Tong
    Zhang, Zheng
    Fu, Yanwei
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 1272 - 1281
  • [9] Representation learning from videos in-the-wild: An object-centric approach
    Romijnders, Rob
    Mahendran, Aravindh
    Tschannen, Michael
    Djolonga, Josip
    Ritter, Marvin
    Houlsby, Neil
    Lucic, Mario
    2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, : 177 - 187
  • [10] Learning Object-Centric Dynamic Modes from Video and Emerging Properties
    Comas, Armand
    Fernandez-Lopez, Christian
    Ghimire, Sandesh
    Li, Haolin
    Sznaier, Mario
    Camps, Octavia
    LEARNING FOR DYNAMICS AND CONTROL CONFERENCE, VOL 211, 2023, 211