Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

被引:0
|
作者
Shixing Han
Jin Liu
Jinyingming Zhang
Peizhu Gong
Xiliang Zhang
Huihua He
机构
[1] Shanghai Maritime University,College of Information Engineering
[2] Shanghai Normal University,College of Early Childhood Education
来源
关键词
Dense video captioning; Cross-modal attention; Commonsense reasoning; Heterogeneous knowledge; Unbiased scene graph;
D O I
暂无
中图分类号
学科分类号
摘要
Dense video captioning (DVC) aims at generating description for each scene in a video. Despite attractive progress for this task, previous works usually only concentrate on exploiting visual features while neglecting audio information in the video, resulting in inaccurate scene event location. In this article, we propose a novel DVC model named CMCR, which is mainly composed of a cross-modal processing (CM) module and a commonsense reasoning (CR) module. CM utilizes a cross-modal attention mechanism to encode data in different modalities. An event refactoring algorithm is proposed to deal with inaccurate event localization caused by overlapping events. Besides, a shared encoder is utilized to reduce model redundancy. CR optimizes the logic of generated captions with both heterogeneous prior knowledge and entities’ association reasoning achieved by building a knowledge-enhanced unbiased scene graph. Extensive experiments are conducted on ActivityNet Captions dataset, the results demonstrate that our model achieves better performance than state-of-the-art methods. To better understand the performance achieved by CMCR, we also apply ablation experiments to analyze the contributions of different modules.
引用
收藏
页码:4995 / 5012
页数:17
相关论文
共 50 条
  • [21] Online Cross-Modal Scene Retrieval by Binary Representation and Semantic Graph
    Qi, Mengshi
    Wang, Yunhong
    Li, Annan
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 744 - 752
  • [22] HSGMP: Heterogeneous Scene Graph Message Passing for Cross-modal Retrieval
    Duan, Yu
    Xiong, Yun
    Zhang, Yao
    Fu, Yuwei
    Zhu, Yangyong
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 82 - 91
  • [23] Spatial-frequency attention-based optical and scene flow with cross-modal knowledge distillation
    Zhou, Youjie
    Jiao, Runyu
    Tao, Zhonghan
    Liang, Xichang
    Wan, Yi
    VISUAL COMPUTER, 2024, : 4183 - 4198
  • [24] Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval
    Zeng, Yawen
    Cao, Da
    Wei, Xiaochi
    Liu, Meng
    Zhao, Zhou
    Qin, Zheng
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 2215 - 2224
  • [25] Semantic-Enhanced Cross-Modal Fusion for Improved Unsupervised Image Captioning
    Xiang, Nan
    Chen, Ling
    Liang, Leiyan
    Rao, Xingdi
    Gong, Zehao
    ELECTRONICS, 2023, 12 (17)
  • [26] Knowledge-Enhanced Scene Graph Generation with Multimodal Relation Alignment (Student Abstract)
    Fu, Ze
    Feng, Junhao
    Zheng, Changmeng
    Cai, Yi
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 12947 - 12948
  • [27] X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning
    Yuan, Zhihao
    Yan, Xu
    Liao, Yinghong
    Guo, Yao
    Li, Guanbin
    Cui, Shuguang
    Li, Zhen
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8553 - 8563
  • [28] Similarity-Based Heterogeneous Graph Attention Network for Knowledge-Enhanced Recommendation
    Zhang, Fan
    Li, Rui
    Xu, Ke
    Xu, Hongguang
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2021, PT II, 2021, 12816 : 488 - 499
  • [29] Cross-Modal Attention Mechanism for Weakly Supervised Video Anomaly Detection
    Sun, Wenwen
    Cao, Lin
    Guo, Yanan
    Du, Kangning
    BIOMETRIC RECOGNITION, CCBR 2023, 2023, 14463 : 437 - 446
  • [30] CM-SC: Cross-modal spatial-channel attention network for image captioning
    Hossain, Md. Shamim
    Aktar, Shamima
    Hossain, Mohammad Alamgir
    Gu, Naijie
    Huang, Zhangjin
    DISPLAYS, 2025, 87