Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

被引:0
|
作者
Shixing Han
Jin Liu
Jinyingming Zhang
Peizhu Gong
Xiliang Zhang
Huihua He
机构
[1] Shanghai Maritime University,College of Information Engineering
[2] Shanghai Normal University,College of Early Childhood Education
来源
关键词
Dense video captioning; Cross-modal attention; Commonsense reasoning; Heterogeneous knowledge; Unbiased scene graph;
D O I
暂无
中图分类号
学科分类号
摘要
Dense video captioning (DVC) aims at generating description for each scene in a video. Despite attractive progress for this task, previous works usually only concentrate on exploiting visual features while neglecting audio information in the video, resulting in inaccurate scene event location. In this article, we propose a novel DVC model named CMCR, which is mainly composed of a cross-modal processing (CM) module and a commonsense reasoning (CR) module. CM utilizes a cross-modal attention mechanism to encode data in different modalities. An event refactoring algorithm is proposed to deal with inaccurate event localization caused by overlapping events. Besides, a shared encoder is utilized to reduce model redundancy. CR optimizes the logic of generated captions with both heterogeneous prior knowledge and entities’ association reasoning achieved by building a knowledge-enhanced unbiased scene graph. Extensive experiments are conducted on ActivityNet Captions dataset, the results demonstrate that our model achieves better performance than state-of-the-art methods. To better understand the performance achieved by CMCR, we also apply ablation experiments to analyze the contributions of different modules.
引用
收藏
页码:4995 / 5012
页数:17
相关论文
共 50 条
  • [1] Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph
    Han, Shixing
    Liu, Jin
    Zhang, Jinyingming
    Gong, Peizhu
    Zhang, Xiliang
    He, Huihua
    COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (05) : 4995 - 5012
  • [2] Cross-Modal Graph With Meta Concepts for Video Captioning
    Wang, Hao
    Lin, Guosheng
    Hoi, Steven C. H.
    Miao, Chunyan
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5150 - 5162
  • [3] Knowledge-Enhanced Context Representation for Unbiased Scene Graph Generation
    Wang, Yuanlong
    Liu, Zhenqi
    Zhang, Hu
    Li, Ru
    WEB AND BIG DATA, APWEB-WAIM 2024, PT I, 2024, 14961 : 248 - 263
  • [4] Weakly-supervised Video Scene Graph Generation via Unbiased Cross-modal Learning
    Wu, Ziyue
    Gao, Junyu
    Xu, Changsheng
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4574 - 4583
  • [5] Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching
    Wu, Bofeng
    Niu, Guocheng
    Yu, Jun
    Xiao, Xinyan
    Zhang, Jian
    Wu, Hua
    PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 1157 - 1164
  • [6] Text-Enhanced Graph Attention Hashing for Cross-Modal Retrieval
    Zou, Qiang
    Cheng, Shuli
    Du, Anyu
    Chen, Jiayi
    ENTROPY, 2024, 26 (11)
  • [7] Cross-Modal Transformer-Based Streaming Dense Video Captioning with Neural ODE Temporal Localization
    Muksimova, Shakhnoza
    Umirzakova, Sabina
    Sultanov, Murodjon
    Cho, Young Im
    SENSORS, 2025, 25 (03)
  • [8] Knowledge-Enhanced Graph Attention Network for Fact Verification
    Chen, Chonghao
    Zheng, Jianming
    Chen, Honghui
    MATHEMATICS, 2021, 9 (16)
  • [9] Referring Expression Comprehension Via Enhanced Cross-modal Graph Attention Networks
    Wang, Jia
    Ke, Jingcheng
    Shuai, Hong-Han
    Li, Yung-Hui
    Cheng, Wen-Huang
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
  • [10] Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning
    Jin, Tao
    Huang, Siyu
    Li, Yingming
    Zhang, Zhongfei
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 2001 - 2011