Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

被引:0
|
作者
Shixing Han
Jin Liu
Jinyingming Zhang
Peizhu Gong
Xiliang Zhang
Huihua He
机构
[1] Shanghai Maritime University,College of Information Engineering
[2] Shanghai Normal University,College of Early Childhood Education
来源
关键词
Dense video captioning; Cross-modal attention; Commonsense reasoning; Heterogeneous knowledge; Unbiased scene graph;
D O I
暂无
中图分类号
学科分类号
摘要
Dense video captioning (DVC) aims at generating description for each scene in a video. Despite attractive progress for this task, previous works usually only concentrate on exploiting visual features while neglecting audio information in the video, resulting in inaccurate scene event location. In this article, we propose a novel DVC model named CMCR, which is mainly composed of a cross-modal processing (CM) module and a commonsense reasoning (CR) module. CM utilizes a cross-modal attention mechanism to encode data in different modalities. An event refactoring algorithm is proposed to deal with inaccurate event localization caused by overlapping events. Besides, a shared encoder is utilized to reduce model redundancy. CR optimizes the logic of generated captions with both heterogeneous prior knowledge and entities’ association reasoning achieved by building a knowledge-enhanced unbiased scene graph. Extensive experiments are conducted on ActivityNet Captions dataset, the results demonstrate that our model achieves better performance than state-of-the-art methods. To better understand the performance achieved by CMCR, we also apply ablation experiments to analyze the contributions of different modules.
引用
收藏
页码:4995 / 5012
页数:17
相关论文
共 50 条
  • [41] Enhanced Cross-Modal Transformer Model for Video Semantic Similarity Measurement
    Li, Da
    Zhu, Boqing
    Xu, Kele
    Yang, Sen
    Feng, Dawei
    Liu, Bo
    Wang, Huaimin
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2024, 71 (01) : 475 - 479
  • [42] Leveraging Weighted Fine-Grained Cross-Graph Attention for Visual and Semantic Enhanced Video Captioning Network
    Verma, Deepali
    Haldar, Arya
    Dutta, Tanima
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 2, 2023, : 2465 - 2473
  • [43] A cross-modal conditional mechanism based on attention for text-video retrieval
    Du, Wanru
    Jing, Xiaochuan
    Zhu, Quan
    Wang, Xiaoyin
    Liu, Xuan
    MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2023, 20 (11) : 20073 - 20092
  • [44] Cross-Modal Semantic Fusion Video Emotion Analysis Based on Attention Mechanism
    Zhao, Lianfen
    Pan, Zhengjun
    2023 8TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYTICS, ICCCBDA, 2023, : 381 - 386
  • [45] Lightweight Cross-Modal Multispectral Pedestrian Detection Based on Spatial Reweighted Attention Mechanism
    Deng, Lujuan
    Fu, Ruochong
    Li, Zuhe
    Liu, Boyi
    Xue, Mengze
    Cui, Yuhao
    CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 78 (03): : 4071 - 4089
  • [46] Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention
    Jiang, Bin
    Huang, Xin
    Yang, Chao
    Yuan, Junsong
    ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 217 - 225
  • [47] From Sparse to Dense: Semantic Graph Evolutionary Hashing for Unsupervised Cross-Modal Retrieval
    Zhao, Yang
    Yu, Jiaguo
    Liao, Shengbin
    Zhang, Zheng
    Zhang, Haofeng
    COMPUTER VISION - ACCV 2022, PT IV, 2023, 13844 : 521 - 536
  • [48] Cross-Modal Graph Knowledge Representation and Distillation Learning for Land Cover Classification
    Wang, Wenzhen
    Liu, Fang
    Liao, Wenzhi
    Xiao, Liang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [49] Cross-modal Knowledge Graph Contrastive Learning for Machine Learning Method Recommendation
    Cao, Xianshuai
    Shi, Yuliang
    Wang, Jihu
    Yu, Han
    Wang, Xinjun
    Yan, Zhongmin
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3694 - 3702
  • [50] Knowledge graph embedding by fusing multimodal content via cross-modal learning
    Liu, Shi
    Li, Kaiyang
    Wang, Yaoying
    Zhu, Tianyou
    Li, Jiwei
    Chen, Zhenyu
    MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2023, 20 (08) : 14180 - 14200