Multimodal Context Fusion Based Dense Video Captioning Algorithm

被引：0

作者：

Li, Meiqi ^{[1
]}

Zhou, Ziwei ^{[1
]}

机构：

[1] Univ Sci & Technol Liaoning, Sch Comp Sci & Software Engn, Anshan 114051, Peoples R China

来源：

ENGINEERING LETTERS | 2025年 / 33卷 / 04期

关键词：

Index Terms; Dense Video Description; Transformer; Mult-imodal feature fusion; Event context; SCN Decoder;

D O I：

暂无

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

The core task of dense video description is to identify all events occurring in an unedited video and generate textual descriptions for these events. This has applications in fields such as assisting visually impaired individuals, generating news headlines, and enhancing human-computer interaction. However, existing dense video description models often overlook the role of textual information (e.g., road signs, subtitles) in video comprehension, as well as the contextual relationships between events, which are crucial for accurate description generation. To address these issues, this paper proposes a multimodal dense video description approach based on event-context fusion. The model utilizes a C3D network to extract visual features from the video and integrates OCR technology to extract textual information, thereby enhancing the semantic understanding of the video content. During feature extraction, sliding window and temporal alignment techniques are applied to ensure the temporal consistency of visual, audio, and textual features. A multimodal context fusion encoder is used to capture the temporal and semantic relationships between events and to deeply integrate multimodal features. The SCN decoder then generates descriptions word by word, improving both semantic consistency and fluency. The model is trained and evaluated on the MSVD and MSR-VTT datasets, and its performance is compared with several popular models. Experimental results show significant improvements in CIDEr evaluation scores, achieving 98.8 and 53.7 on the two datasets, respectively. Additionally, ablation studies are conducted to comprehensively assess the effectiveness and stability of each component of the model.

引用

页码：1061 / 1072

页数：12

共 50 条

[1] Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning
Wang, Jingwen
Jiang, Wenhao
Ma, Lin
Liu, Wei
Xu, Yong
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7190 - 7198
[2] Multimodal Pretraining for Dense Video Captioning
Huang, Gabriel
Pang, Bo
Zhu, Zhenhai
Rivera, Clara
Soricut, Radu
1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
[3] Multimodal Interaction Fusion Network Based on Transformer for Video Captioning
Xu, Hui
Zeng, Pengpeng
Khan, Abdullah Aman
ARTIFICIAL INTELLIGENCE AND ROBOTICS, ISAIR 2022, PT I, 2022, 1700 : 21 - 36
[4] Multimodal feature fusion based on object relation for video captioning
Yan, Zhiwen
Chen, Ying
Song, Jinlong
Zhu, Jia
CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2023, 8 (01) : 247 - 259
[5] Hierarchical attention-based multimodal fusion for video captioning
Wu, Chunlei
Wei, Yiwei
Chu, Xiaoliang
Weichen, Sun
Su, Fei
Wang, Leiquan
NEUROCOMPUTING, 2018, 315 : 362 - 370
[6] Dense Video Captioning With Early Linguistic Information Fusion
Aafaq, Nayyer
Mian, Ajmal
Akhtar, Naveed
Liu, Wei
Shah, Mubarak
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2309 - 2322
[7] Position embedding fusion on transformer for dense video captioning
Yang, Sixuan
Tang, Pengjie
Wang, Hanli
Li, Qinyu
DEVELOPMENTS OF ARTIFICIAL INTELLIGENCE TECHNOLOGIES IN COMPUTATION AND ROBOTICS, 2020, 12 : 792 - 799
[8] Cross-Domain Modality Fusion for Dense Video Captioning
Aafaq N.
Mian A.
Liu W.
Akhtar N.
Shah M.
IEEE Transactions on Artificial Intelligence, 2022, 3 (05): : 763 - 777
[9] Dense video captioning based on local attention
Qian, Yong
Mao, Yingchi
Chen, Zhihao
Li, Chang
Bloh, Olano Teah
Huang, Qian
IET IMAGE PROCESSING, 2023, 17 (09) : 2673 - 2685
[10] Stacked Multimodal Attention Network for Context-Aware Video Captioning
Zheng, Yi
Zhang, Yuejie
Feng, Rui
Zhang, Tao
Fan, Weiguo
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 31 - 42

← 1 2 3 4 5 →