Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning

被引:0
|
作者
Fang, Zhiyuan [1 ]
Gokhale, Tejas [1 ]
Banerjee, Pratyay [1 ]
Baral, Chitta [1 ]
Yang, Yezhou [1 ]
机构
[1] Arizona State Univ, Tempe, AZ 85287 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent. Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects. We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset "Video-to-Commonsense (V2C)" that contains similar to 9k videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions. Both the generation task and the QA task can be used to enrich video captions.
引用
收藏
页码:840 / 860
页数:21
相关论文
共 32 条
  • [1] CAVAN: Commonsense Knowledge Anchored Video Captioning
    Shao, Huiliang
    Fang, Zhiyuan
    Yang, Yezhou
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 4095 - 4102
  • [2] Joint Commonsense and Relation Reasoning for Image and Video Captioning
    Hou, Jingyi
    Wu, Xinxiao
    Zhang, Xiaoxun
    Qi, Yayun
    Jia, Yunde
    Luo, Jiebo
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 10973 - 10980
  • [3] Visual Commonsense-Aware Representation Network for Video Captioning
    Zeng, Pengpeng
    Zhang, Haonan
    Gao, Lianli
    Li, Xiangpeng
    Qian, Jin
    Shen, Heng Tao
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (01) : 1092 - 1103
  • [4] Visual Commonsense-Aware Representation Network for Video Captioning
    Zeng, Pengpeng
    Zhang, Haonan
    Gao, Lianli
    Li, Xiangpeng
    Qian, Jin
    Shen, Heng Tao
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (01) : 1092 - 1103
  • [5] Hybrid Reasoning Network for Video-based Commonsense Captioning
    Yu, Weijiang
    Liang, Jian
    Ji, Lei
    Li, Lu
    Fang, Yuejian
    Xiao, Nong
    Duan, Nan
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5213 - 5221
  • [6] Implicit and explicit commonsense for multi-sentence video captioning
    Chou, Shih-Han
    Little, James J.
    Sigal, Leonid
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 247
  • [7] GPT-Based Knowledge Guiding Network for Commonsense Video Captioning
    Yuan, Mengqi
    Jia, Gengyun
    Bao, Bing-Kun
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 5147 - 5158
  • [8] Optimizing Video Selection LIMIT Queries With Commonsense Knowledge
    He, Wenjia
    Sabek, Ibrahim
    Lou, Yuze
    Cafarella, Michael
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (07): : 1751 - 1764
  • [9] PAINE Demo: Optimizing Video Selection Queries With Commonsense Knowledge
    He, Wenjia
    Sabek, Ibrahim
    Lou, Yuze
    Cafarella, Michael
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (12): : 3902 - 3905
  • [10] Utilizing a Dense Video Captioning Technique for Generating Image Descriptions of Comics for People with Visual Impairments
    Kim, Suhyun
    Lee, Semin
    Kim, Kyungok
    Oh, Uran
    PROCEEDINGS OF 2024 29TH ANNUAL CONFERENCE ON INTELLIGENT USER INTERFACES, IUI 2024, 2024, : 750 - 760