Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

被引：0

作者：

Shixing Han

Jin Liu

Jinyingming Zhang

Peizhu Gong

Xiliang Zhang

Huihua He

机构：

[1] Shanghai Maritime University,College of Information Engineering

[2] Shanghai Normal University,College of Early Childhood Education

来源：

Complex & Intelligent Systems | 2023年 / 9卷

关键词：

Dense video captioning; Cross-modal attention; Commonsense reasoning; Heterogeneous knowledge; Unbiased scene graph;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Dense video captioning (DVC) aims at generating description for each scene in a video. Despite attractive progress for this task, previous works usually only concentrate on exploiting visual features while neglecting audio information in the video, resulting in inaccurate scene event location. In this article, we propose a novel DVC model named CMCR, which is mainly composed of a cross-modal processing (CM) module and a commonsense reasoning (CR) module. CM utilizes a cross-modal attention mechanism to encode data in different modalities. An event refactoring algorithm is proposed to deal with inaccurate event localization caused by overlapping events. Besides, a shared encoder is utilized to reduce model redundancy. CR optimizes the logic of generated captions with both heterogeneous prior knowledge and entities’ association reasoning achieved by building a knowledge-enhanced unbiased scene graph. Extensive experiments are conducted on ActivityNet Captions dataset, the results demonstrate that our model achieves better performance than state-of-the-art methods. To better understand the performance achieved by CMCR, we also apply ablation experiments to analyze the contributions of different modules.

引用

页码：4995 / 5012

页数：17

共 50 条

[1] Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph
Han, Shixing
Liu, Jin
Zhang, Jinyingming
Gong, Peizhu
Zhang, Xiliang
He, Huihua
COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (05) : 4995 - 5012
[2] Cross-Modal Graph With Meta Concepts for Video Captioning
Wang, Hao
Lin, Guosheng
Hoi, Steven C. H.
Miao, Chunyan
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5150 - 5162
[3] Knowledge-Enhanced Context Representation for Unbiased Scene Graph Generation
Wang, Yuanlong
Liu, Zhenqi
Zhang, Hu
Li, Ru
WEB AND BIG DATA, APWEB-WAIM 2024, PT I, 2024, 14961 : 248 - 263
[4] Weakly-supervised Video Scene Graph Generation via Unbiased Cross-modal Learning
Wu, Ziyue
Gao, Junyu
Xu, Changsheng
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4574 - 4583
[5] Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching
Wu, Bofeng
Niu, Guocheng
Yu, Jun
Xiao, Xinyan
Zhang, Jian
Wu, Hua
PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 1157 - 1164
[6] Text-Enhanced Graph Attention Hashing for Cross-Modal Retrieval
Zou, Qiang
Cheng, Shuli
Du, Anyu
Chen, Jiayi
ENTROPY, 2024, 26 (11)
[7] Cross-Modal Transformer-Based Streaming Dense Video Captioning with Neural ODE Temporal Localization
Muksimova, Shakhnoza
Umirzakova, Sabina
Sultanov, Murodjon
Cho, Young Im
SENSORS, 2025, 25 (03)
[8] Knowledge-Enhanced Graph Attention Network for Fact Verification
Chen, Chonghao
Zheng, Jianming
Chen, Honghui
MATHEMATICS, 2021, 9 (16)
[9] Referring Expression Comprehension Via Enhanced Cross-modal Graph Attention Networks
Wang, Jia
Ke, Jingcheng
Shuai, Hong-Han
Li, Yung-Hui
Cheng, Wen-Huang
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
[10] Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning
Jin, Tao
Huang, Siyu
Li, Yingming
Zhang, Zhongfei
2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 2001 - 2011

← 1 2 3 4 5 →