Object-Centric Representation Learning for Video Question Answering

被引:1
|
作者
Long Hoang Dang [1 ]
Thao Minh Le [1 ]
Vuong Le [1 ]
Truyen Tran [1 ]
机构
[1] Deakin Univ, Appl Artificial Intelligence Inst, Burwood, Australia
关键词
D O I
10.1109/IJCNN52387.2021.9533961
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video question answering (Video QA) presents a powerful testbed for human-like intelligent behaviors. The task demands new capabilities to integrate video processing, language understanding, binding abstract linguistic concepts to concrete visual artifacts, and deliberative reasoning over space-time. Neural networks offer a promising approach to reach this potential through learning from examples rather than handcrafting features and rules. However, neural networks are predominantly feature-based - they map data to unstructured vectorial representation and thus can fall into the trap of exploiting shortcuts through surface statistics instead of true systematic reasoning seen in symbolic systems. To tackle this issue, we advocate for object-centric representation as a basis for constructing spatio-temporal structures from videos, essentially bridging the semantic gap between low-level pattern recognition and high-level symbolic algebra. To this end, we propose a new query-guided representation framework to turn a video into an evolving relational graph of objects, whose features and interactions are dynamically and conditionally inferred. The object lives are then summarized into resumes, lending naturally for deliberative relational reasoning that produces an answer to the query. The framework is evaluated on major Video QA datasets, demonstrating clear benefits of the object-centric approach to video reasoning.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] Object-centric Learning with Capsule Networks: A Survey
    Ribeiro, Fabio De Sousa
    Duarte, Kevin
    Everett, Miles
    Leontidis, Georgios
    Shah, Mubarak
    ACM COMPUTING SURVEYS, 2024, 56 (11)
  • [22] Object-Centric Video Anomaly Detection with Covariance Features
    Bilecen, Ali Enver
    Ozkan, Huseyin
    2022 30TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU, 2022,
  • [23] Uni-and-Bi-Directional Video Prediction via Learning Object-Centric Transformation
    Chen, Xiongtao
    Wang, Wenmin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (06) : 1591 - 1604
  • [24] Object-Centric Debugging
    Ressia, Jorge
    Bergel, Alexandre
    Nierstrasz, Oscar
    2012 34TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2012, : 485 - 495
  • [25] Semantic Tracklets: An Object-Centric Representation for Visual Multi-Agent Reinforcement Learning
    Liu, Iou-Jen
    Ren, Zhongzheng
    Yeh, Raymond A.
    Schwing, Alexander G.
    2021 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2021, : 5603 - 5610
  • [26] Video Question Answering With Prior Knowledge and Object-Sensitive Learning
    Zeng, Pengpeng
    Zhang, Haonan
    Gao, Lianli
    Song, Jingkuan
    Shen, Heng Tao
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5936 - 5948
  • [27] Segmenting Moving Objects via an Object-Centric Layered Representation
    Xie, Junyu
    Xie, Weidi
    Zisserman, Andrew
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [28] Rethinking Image-to-Video Adaptation: An Object-Centric Perspective
    Qian, Rui
    Ding, Shuangrui
    Lin, Dahua
    COMPUTER VISION-ECCV 2024, PT XLIII, 2025, 15101 : 329 - 348
  • [29] Self-supervised Object-Centric Learning for Videos
    Aydemir, Gorkay
    Xie, Weidi
    Guney, Fatma
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [30] Object-Centric Multiple Object Tracking
    Zhao, Zixu
    Wang, Jiaze
    Horn, Max
    Ding, Yizhuo
    He, Tong
    Bai, Zechen
    Zietlow, Dominik
    Simon-Gabriel, Carl-Johann
    Shuai, Bing
    Tu, Zhuowen
    Brox, Thomas
    Schiele, Bernt
    Fu, Yanwei
    Locatello, Francesco
    Zhang, Zheng
    Xiao, Tianjun
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 16555 - 16565